Category: Personal Data (Page 1 of 8)

Toward a Personal AI Roadmap for VRM

June 18, 2025 / Doc Searls / 4 Comments

On the ProjectVRM list, John Wunderlich shared a find that makes clear how advanced and widespread AI-based shopping recommendation has gone so far (and not just with ChatGPT and Amazon). Here it is: Envisioning Recommendations on an LLM-Based Agent Platform: Can LLM-based agents take recommender systems to the next level?

It’s by Jizhi Zhang, Keqin Bao, Wenjie Wang, Yang Zhang, Wentao Shi, Wanhong Xu, Fuli Feng, and Tat-Seng Chua* and is published in the Artificial Intelligence and Machine Learning section of Research and Advances in Communications of the ACM. So it’s serious stuff.

Here’s one graphic from the piece, with its caption:

Illustration of the Rec4Agentverse. The left side depicts three roles in the RecAgentverse: the user, the Agent Recommender, and Item Agents, along with their interconnected relationships. In contrast to traditional recommender systems, the Rec4Agentverse has more intimate relationships among the three roles. For instance, there are multi-round interactions between 1) users and Item Agents and 2) the Agent Recommender and Item Agents. The right side demonstrates how the Agent Recommender can collaborate with Item Agents to affect the information flow of users and offer personalized information services.

With VRM, however, we need our own personal AI agents, and useful data that’s not limited to what’s gathered by surveillance (which we can assume is most if not all of the Information layer).

So I started a conversation with ChatGPT. I’ll divide it into sections:

Who wrote the CACM piece
The state of recommendation science
Who is working on personal AI
How one can get started
Forms of personal data to manage

There’s a lot here. The meat of it, for ProjectVRM purposes, starts in section 3. (The first two are more about what we’re up against.) Please read the whole thing and help us think through where to go with the challenge facing us. As we see in section 3, we do have some stuff on our side.

1. Who wrote the CACM piece

Me:

Can you tell me more about the authors here: https://cacm.acm.org/research/envisioning-recommendations-on-an-llm-based-agent-platform/?

ChatGPT:

🧑‍💼 Jizhi Zhang

Affiliation: University of Science and Technology of China (USTC), Data Science Lab (inferred from USTC page) cacm.acm.org+14data-science.ustc.edu.cn+14x.com+14.
Research Interests: Recommendation systems, LLM-based agent platforms, and interactive intelligent systems—his lead authorship and continued work on Prospect Personalized Recommendation on LLM-Based Agent Platform (arXiv Feb 2024) reinforce this arxiv.org.

🧑‍🔬 Keqin Bao, Wenjie Wang, Yang Zhang, Wentao Shi, Fuli Feng, Tat‑Seng Chua

Affiliations: While CACM doesn’t list this, Jizhi Zhang and Fuli Feng are part of USTC’s Data Science Lab; Tat‑Seng Chua is a distinguished researcher from the National University of Singapore known for multimedia and information retrieval.
Expertise: A mix of recommender system design, LLM-agent evaluation, multimedia analysis, and human-AI interaction—reflecting a blend of USTC’s strengths and CACM relevance data-science.ustc.edu.cn+8cacm.acm.org+8data-science.ustc.edu.cn+8linkedin.com+4arxiv.org+4researchgate.net+4.

🧑‍🎓 Wanhong Xu

Affiliation: USTC Data Science Lab (USTC LDS), contributor to the same article .
Role: Emerging researcher whose latest publication is this Rec4Agentverse article, indicating an early-career focus in LLM-based recommender systems linkedin.com+8data-science.ustc.edu.cn+8linkedin.com+8.

🧠 Broader Context & Contributions

The group operates at the intersection of Large Language Models, agent-oriented systems, and recommender systems, pioneering a Rec4Agentverse paradigm that treats LLM-powered agents as “recommendable” items collaborating with one another—an evolution of traditional recommendation frameworkscacm.acm.org+5cacm.acm.org+5cacm.acm.org+5.
Their work continues earlier research (e.g., Prospect Personalized Recommendation… in Feb 2024), solidifying their ongoing commitment to expanding intelligent, conversational recommendation platforms arxiv.org.

🧭 Why This Matters

Academic Thought Leadership: Jizhi Zhang and colleagues are pushing forward thinking about how recommendations can evolve when agents—not passive items—are central.
Institutional Backing: All deeply connected to USTC’s Data Science Lab, underscoring strong Chinese academic investment in AI-driven recommender innovation.
Collaborative Reach: Including scholars like Tat‑Seng Chua broadens the research to include international perspectives in multimedia and retrieval systems.

In summary: This team is a multidisciplinary cohort—centered at USTC with international collaboration—who are pioneering the idea that LLM agents can become interactive, dynamic items in recommender systems. Their work positions them at the forefront of intelligent agent systems and could reshape how platforms personalize experiences in the future.

If you’re curious about any individual’s past publications, citation impact, or related work (e.g., on Agent4Rec), I’d be happy to dive deeper!

3. The state of recommendation science

Me:

How large is the emerging science, within the AI field, of recommendations?

ChatGPT:

The science of recommendations, particularly within the AI field, is not just emerging—it’s a massive, rapidly evolving subfield that is both academically rich and commercially dominant. That said, it’s also entering a new phase, driven by the rise of large language models (LLMs), interactive agents, and multimodal platforms.

📚 Academic Scope

1. Established Field: Recommender Systems

Recommender systems (RecSys) are a well-developed domain within AI and machine learning, with roots in the 1990s.
Annual conferences like:
ACM RecSys (since 2007)
SIGIR, KDD, WSDM, and NeurIPS regularly feature recommendation research.
Thousands of papers annually on topics like:
Collaborative filtering
Matrix/tensor factorization
Deep learning for recommendation
Fairness, transparency, and explainability in RecSys
Reinforcement learning and bandit algorithms for personalization

🔎 As of 2024, over 60,000 papers in academic databases (e.g., Google Scholar) are tagged or indexed under “recommender systems”.

📈 Current Trends & Expanding Frontiers

2. LLMs + Recommendation = New Subfield

A major shift is underway: integrating Large Language Models (LLMs) into recommendation systems.
These go beyond “you may also like” to:
Use natural language context and dialogue
Generate explanations, not just rankings
Support conversational recommendation
Coordinate agent ecosystems, e.g., multi-agent reasoning environments

Example paradigms:

RecMind – reasoning-aware recommendation agents
Rec4Agentverse – treating agents as recommendable entities
ChatRecs – integrating LLMs into live user conversations

3. Multimodal & Multitask Integration

Systems increasingly blend text, image, video, behavior, and user feedback.
LLMs allow richer personalization using complex user intent derived from unstructured data.

4. Industry Investment

FAANG companies (especially YouTube, Netflix, Amazon, TikTok, and Spotify) base their core user experience on AI-driven recommendation.
Billions of dollars are spent optimizing:
Ranking quality
Engagement prediction
Real-time inference
Federated and privacy-preserving personalization

🧭 Estimate of Scope

Area

Estimate (2024–25)

📄 Academic papers (RecSys + AI)

60,000+ overall; thousands per year

📚 Top conferences with recommender tracks

20+ (ACM RecSys, SIGIR, KDD, NeurIPS, etc.)

👩‍💼 Active researchers

Tens of thousands globally

🧠 Emerging subfields

LLM-based Rec, conversational Rec, agent-based Rec, explainable Rec

💰 Industry scale

$10B+ annually invested in personalization tech

🧠 In Summary

The science of recommendation is:

Not new, but rapidly evolving
Core to AI applications, especially with LLMs and agents
One of the most commercially and socially impactful domains in modern AI
Growing into new subfields (e.g. Rec4Agentverse, LLM Rec, multi-agent Rec)

Let me know if you’d like a visual map of how the recommendation science ecosystem is structured—or who the top authors and institutions are in this space

Estimate (2024–25)
📄 Academic papers (RecSys + AI)	60,000+ overall; thousands per year
📚 Top conferences with recommender tracks	20+ (ACM RecSys, SIGIR, KDD, NeurIPS, etc.)
👩‍💼 Active researchers	Tens of thousands globally
🧠 Emerging subfields	LLM-based Rec, conversational Rec, agent-based Rec, explainable Rec
💰 Industry scale	$10B+ annually invested in personalization tech

3. Who is working on personal AI

Me:

And who, if anyone, is working to provide customers with their own personal AI tools: ones that work for them, to empower their own agency in the marketplace? By which I mean truly personal AI, and not the kind personalized by some external LLM (or similar) system.

ChatGPT:

That’s an essential and forward-looking question—especially as most “AI for you” today really means “AI about you, for someone else.”

While Big Tech firms dominate personalized AI, a growing ecosystem is focused on truly personal AI—that is, AI you own, that works for you (not advertisers or platforms), and that enhances your agency in the marketplace and your life.

Here are the most notable people, projects, and initiatives advancing truly personal AI today:

🧠 1. Kwaai and pAI-OS (South Africa / Global)

Kwaai is building an open-source, user-sovereign personal AI OS called pAI-OS.
Designed to run locally or at user discretion, with full data ownership, modular agent systems, and plugin capability.
Aligns with IEEE P7012 (“MyTerms”) privacy standards (see below).
Actively exploring intentcasting, customer signaling, and VRM (Vendor Relationship Management).
Site: https://kwaai.ai

🛠️ 2. Indie AI and Personal Agent projects

Several smaller open-source initiatives are trying to build personal agent toolkits that you can host and configure:

Examples:

AutoGen Studio (Microsoft) – allows building agent workflows, some of which can be user-defined. But it’s not yet fully personal or private.
Langroid – open-source framework for local agent-based LLM systems. Designed for extensibility, can be made personal with effort.
OpenDevin, Superagent, Auto-GPT – can be self-hosted, though most are still tool-centric rather than user-centric.
Personal.ai – A commercial startup that says it is personal, but uses centralized storage and fine-tuning, so actual user control is limited.

🔍 Reality check: Most of these still rely on external LLMs (like OpenAI or Anthropic), even if they claim to be “your AI”.

🌐 3. Solid Project (by Tim Berners-Lee)

While not AI per se, Solid is a decentralized identity and data storage architecture that allows individuals to store their own data in personal data pods.
AI tools can be layered on top to access only the data you choose to share.
Foundation for personal AI assistants that can work with your data, not on others’ clouds.
https://solidproject.org

📜 4. IEEE P7012 – MyTerms

A standard being developed to flip the notice-and-consent regime—so individuals, not companies, dictate the terms of data use.
Backed by Customer Commons and designed to support intentcasting and personal marketplace signaling.
This enables a personal AI to serve as your agent, enforcing your terms and engaging the marketplace on your behalf.

🧩 5. Project VRM (Vendor Relationship Management)

Initiated by Doc Searls to empower customers with tools to manage vendors, not just the reverse.
Envisions personal tools (including AI) to:
Signal intent
Manage service contracts
Retain records of offers, purchases, warranties
Avoid manipulation, surveillance, and lock-in

Project VRM is not a software project per se, but a framework guiding developers to build personal tools that support individual autonomy.

🔒 6. Open-source LLM stacks for local use

Several local LLM projects can be combined with personal data layers to build truly personal AI:

LM Studio – runs local models on desktop
PrivateGPT – local inference + local documents
Ollama – macOS-native interface for fast-running models like LLaMA, Mistral
LangChain / LlamaIndex – frameworks for building RAG systems with local documents

These aren’t personal AI systems yet—but they are key ingredients.

🧭 Summary

Name / Project

What They Do

Status

Kwaai / pAI-OS

Open, personal AI OS and agent platform

Early but active

Solid

Personal data control layer

Stable; in use

IEEE P7012

Standards for user-enforced data terms

In development

Project VRM

Framework for tools that empower buyers

Thought leadership; toolkits emerging

Local AI frameworks

Tech foundations for personal AI

Available, DIY-friendly

🚧 The Gap

Despite this promising landscape, no major platform today delivers a fully integrated, easy-to-use personal AI that you own, configure, and direct—yet. But:

A personal AI revolution is underway, led by privacy advocates, open-source communities, and a few visionary developers—not Big Tech.

Would you like help putting together a DIY stack for your own personal AI toolkit based on this landscape?

What They Do	Status
Kwaai / pAI-OS	Open, personal AI OS and agent platform	Early but active
Solid	Personal data control layer	Stable; in use
IEEE P7012	Standards for user-enforced data terms	In development
Project VRM	Framework for tools that empower buyers	Thought leadership; toolkits emerging
Local AI frameworks	Tech foundations for personal AI	Available, DIY-friendly

Me:

Can you name some of the people in the personal AI revolution, and tell me what that DIY stack would contain?

ChatGPT:

Absolutely. Here’s a breakdown of key people involved in the personal AI movement, followed by a practical DIY stack you can start building today.

👤 People Leading or Advancing the Personal AI Revolution

These individuals aren’t all working on the same projects, but they’re aligned in trying to build AI that’s owned, run, and controlled by the individual—not centralized platforms.

🧑‍🔬 Tim Berners-Lee

Project: Solid (stewarded by ODI, commercially implemented by companies including TBLs Inrupt)
Focus: Decentralized web and personal data ownership via data pods
Impact: Infrastructure for privacy-respecting personal AI agents
https://solidproject.org

👤 Doc Searls

Project: Project VRM, Customer Commons, IEEE P7012
Focus: Customer empowerment, intentcasting, reversing surveillance capitalism
Impact: Philosophy and standards for customer-directed personal AI
https://projectvrm.org

🧠 Tawanda Kembo & Kwaai team

Project: pAI-OS (personal AI operating system)
Focus: Agent-centric open-source operating system for personal AI
Impact: One of the few efforts aiming to create a complete personal AI stack
https://kwaai.ai

🧑‍💻 Simon Willison

Project: Datasette, open tools for data management
Focus: Personal data control, lightweight databases, LLM + local tooling
Impact: Key figure in the open data + local AI tooling space
https://simonwillison.net

🤖 Brian Roemmele

Project: Multiplex (personal voice AI)
Focus: Voice-first personal assistants you run locally
Impact: Vocal advocate for true personal AI, often referencing “software you own”

🛠️ A DIY Personal AI Stack (2025 Edition)

This stack assumes local-first, modular, privacy-respecting tools. You can mix and match based on your needs.

🧩 Core Components

Layer

Tools / Projects

Notes

1. OS Layer

Linux, pAI-OS (in dev)

Linux gives flexibility; pAI-OS aims to be a full personal AI OS

2. Local LLM Inference

Ollama, LM Studio, PrivateGPT, LocalAI

Run models like LLaMA 3, Mistral, Phi-3 entirely offline

3. Vector Database

Chroma, Weaviate, Milvus, or SQLite with FTS5

Store and retrieve embedded text data

4. Personal Indexing

LlamaIndex, LangChain, AutoGen, GPTPilot

Feed personal documents and data to your LLM

5. Personal Data Storage

Solid POD, Datasette, Obsidian

Your life’s data, structured and queryable

6. Personal Agent Framework

Superagent, CrewAI, Langroid, AutoGen

Design agents that act on your behalf, modular and scriptable

7. Front-end Interface

Gradio, Streamlit, Tauri app, Electron

Local GUI/UX to interact with your agent

8. Voice Interface (optional)

OpenVoice, Vosk, Whisper.cpp, Coqui TTS

Build voice-in/voice-out agent on your machine

🧠 Optional Add-ons

Offline search: Chat over your files, PDFs, photos, and emails with LlamaIndex + PrivateGPT
Calendar/notes: Sync with Obsidian or local markdown-based knowledge base
Browser agent: Auto-control browser via Puppeteer/Playwright for marketplace navigation

🧭 What’s Still Missing?

True personal OS integration (like what pAI-OS is trying to do)
Seamless data portability and contract enforcement (IEEE P7012 is working on it)
Trusted personal AI app ecosystem (some apps, like Beeper or Obsidian, are getting close)

Would you like a tailored recommendation for building a personal AI stack based on your technical skill level or use cases (e.g., journaling, research, negotiation, shopping, data querying)?

Tools / Projects	Notes
1. OS Layer	Linux, pAI-OS (in dev)	Linux gives flexibility; pAI-OS aims to be a full personal AI OS
2. Local LLM Inference	Ollama, LM Studio, PrivateGPT, LocalAI	Run models like LLaMA 3, Mistral, Phi-3 entirely offline
3. Vector Database	Chroma, Weaviate, Milvus, or SQLite with FTS5	Store and retrieve embedded text data
4. Personal Indexing	LlamaIndex, LangChain, AutoGen, GPTPilot	Feed personal documents and data to your LLM
5. Personal Data Storage	Solid POD, Datasette, Obsidian	Your life’s data, structured and queryable
6. Personal Agent Framework	Superagent, CrewAI, Langroid, AutoGen	Design agents that act on your behalf, modular and scriptable
7. Front-end Interface	Gradio, Streamlit, Tauri app, Electron	Local GUI/UX to interact with your agent
8. Voice Interface (optional)	OpenVoice, Vosk, Whisper.cpp, Coqui TTS	Build voice-in/voice-out agent on your machine

4. How can I get started?

Me:

Yes. My technical skill level is modest at best. I can write HTML and do very basic stuff in the command line. I have an extra laptop I can devote to solo work, and Ethernet to a 2.1GB/s symmetrical fiber connection to the Internet with no data limits.

ChatGPT:

Great — with your setup (modest technical skills, spare laptop, blazing-fast fiber), you’re in an excellent position to begin building a practical personal AI system without diving too deep into complex dev work. Here’s a recommended stack tailored to your current skills and goals:

🧰 YOUR PERSONAL AI STARTER STACK (2025)

🔧 1. Platform Base

Use a Linux distribution designed for ease of use and AI tinkering:

Recommended: Pop!_OS or Ubuntu 24.04
Easy GUI, good hardware compatibility, lots of online support
Optional: Use your laptop in “air-gapped” mode for privacy and full local control

🤖 2. Run a Local LLM with LM Studio

LM Studio (macOS/Windows/Linux) lets you run LLaMA 3, Mistral, Phi-3, and others without coding.
Built-in UI — chat with models locally, no internet required after download.
You can download 7B or 13B models based on your hardware.
Best for: private journaling, summarizing files, answering questions.

✅ Why it’s ideal for you:

GUI-based
Just click and download models
No API keys or command-line needed once installed

📁 3. Chat with Your Files (PrivateGPT / LlamaIndex)

PrivateGPT or GPTPilot can let you ask questions across PDFs, emails, and notes stored on your machine.
Start simple: drop files into a folder, let the AI index them.
Optionally install via Docker (one command), or use their prebuilt apps.

✅ Use case:

“Summarize this research PDF”
“What did I write about project X last month?”
“Find the email where someone mentioned a refund”

📚 4. Knowledge Base + Journaling with Obsidian

Obsidian = Markdown-based knowledge base.
Everything you write stays local.
You can point LM Studio to your notes to query your own memory.

✅ Easy way to:

Keep private journals
Index your life
Ask your AI things like “What are my goals for June?”

🗂️ 5. Personal Data Management with Datasette

Datasette by Simon Willison lets you browse and query your own data with simple SQL and CSV files.
Feeds AI with structured personal info (purchases, inventory, plans, etc.)
You can install it with a single terminal command:
bash
pip install datasette datasette serve your-data.dbdatasette serve your-data.db

✅ Combine with LM Studio to get responses like:

“What subscriptions do I need to cancel?”
“When did I last back up my photos?”

🌐 6. Optional: Browser Helper Agent (Local, Simple)

Use AutoBrowser or a simpler Puppeteer script to automate browsing tasks (like checking prices, news, account activity).
Start with prebuilt scripts. No coding needed if you copy/paste working examples.

🔒 7. Privacy Boosters

Run Tailscale or WireGuard to access your AI remotely if needed.
Use LocalAI or Ollama if you later want to run everything without GUI (more advanced).

🧭 Putting It All Together

Task

Tool

Chat privately with an AI

LM Studio

Summarize your own notes or PDFs

PrivateGPT or GPTPilot

Store and reflect on your life data

Obsidian, Datasette

Manage subscriptions and tasks

Datasette + Chat

Ask about your files or history

LlamaIndex or GPT over Obsidian notes

Use voice later (optional)

Whisper.cpp, OpenVoice (easy tools)

🪜 Next Steps (in plain order)

Install Pop!_OS or Ubuntu on your spare laptop
Install LM Studio and download a 7B model like Mistral or LLaMA 3
Start using it to answer personal questions and summarize documents
Add Obsidian to start journaling and reflecting
Try PrivateGPT or Datasette to search your past or track your life

Would you like a checklist or an install script to get started with this stack? I can also help you set up a file structure to begin tracking and querying your own data.

Tool
Chat privately with an AI	LM Studio
Summarize your own notes or PDFs	PrivateGPT or GPTPilot
Store and reflect on your life data	Obsidian, Datasette
Manage subscriptions and tasks	Datasette + Chat
Ask about your files or history	LlamaIndex or GPT over Obsidian notes
Use voice later (optional)	Whisper.cpp, OpenVoice (easy tools)

5. Forms of personal data to manage

Me:

The personal data I need to manage are health records (from 20-plus providers), financial records (fewer than health), travel records (including the kind kept by apps that watch my movements), subscriptions and other recurring payments, shows I have watched and on what channels or services (data gathered by my Samsung and Roku TVs and presumably shared with others but to which I don’t yet have access), continuing work and work history with various clients and employers, contacts, archival and current calendars, real estate, personal items bought or acquired from various sources and where they are kept or stored, and business dealings. What would be the file structure for all that, and how would I manage it? Note that I am not talking about shopping yet; that’s a separate concern.

ChatGPT:

Thanks for the detailed breakdown — you’re describing the foundation of a true personal data architecture: structured, queryable, and under your control. Let’s build you a file structure and management strategy that matches your needs without overcomplicating the tech.

📁 Recommended Top-Level Folder Structure

This folder structure is designed to be future-proof, machine-readable, and human-usable. Use plain folders, CSV/JSON/Markdown files, and optional SQLite databases to store structured data. You can later layer tools (like LLMs or Datasette) on top.

markdown

/PersonalAI_Data

│

├── Health/

│   ├── Providers/

│   ├── Visits/

│   ├── Prescriptions/

│   ├── Labs_Imaging/

│   └── Insurance_Claims/

│

├── Finance/

│   ├── Bank_Accounts/

│   ├── Credit_Cards/

│   ├── Investments/

│   └── Taxes/

│

├── Travel/

│   ├── Trips/

│   ├── Locations_History/

│   └── Maps_GPX_KML/

│

├── Subscriptions/

│   ├── Streaming/

│   ├── News_Publications/

│   ├── Software_Tools/

│   └── Utilities_Bills/

│

├── Media_Viewed/

│   ├── TV_Shows/

│   ├── Movies/

│   ├── Channels_Services/

│   └── Watch_History.csv

│

├── Work/

│   ├── Employers/

│   ├── Clients/

│   ├── Projects/

│   └── Timesheets_Invoices/

│

├── Contacts/

│   └── contacts.csv

│

├── Calendar/

│   ├── Archive_ICS/

│   ├── Current_ICS/

│   └── Events_Tags.csv

│

├── Real_Estate/

│   ├── Properties_Owned/

│   ├── Rentals/

│   └── Maintenance_Records/

│

├── Personal_Items/

│   ├── Furniture/

│   ├── Electronics/

│   ├── Art_Collectibles/

│   └── Inventory.csv

│

└── Business/

    ├── Deals/

├── Agreements/

├── Partners/

└── Notes/

🧠 Strategy for Managing It All

1. Use Consistent File Types

Data Type	File Format
Structured entries	`.csv`, `.json`, `.sqlite`
Notes or logs	`.md` (markdown)
Documents	`.pdf`, `.docx`, `.txt`
Images/attachments	store in `attachments/` subfolder per topic

2. Naming Conventions (Machine + Human Friendly)

Use ISO-like formats to make things sortable and parseable:

3. Data Capture Tools

You don’t need to build it all manually:

Use Obsidian for ongoing notes (e.g. travel journal, work logs)
Use Paperless-ngx (optional) to OCR and tag your scanned PDFs (health, finance)
Use GPS History Export from Google and Apple for travel/locations
Use email filters or Zapier to auto-download bills, statements, confirmations

4. Querying and Reviewing

Add one or more of these later:

Datasette → browse and search CSVs/SQLite locally
LM Studio + PrivateGPT → ask, “When did I last visit Dr. Chen?” or “What devices are in storage?”
LlamaIndex → index your file tree and let AI search/summarize it

✅ Summary of Benefits

Goal	This Setup Delivers
Total local control	You decide access, storage, retention
Future AI-readiness	Structured enough for agents to use
Easy to back up	Plain files, portable databases
Modular growth	Start simple, layer in AI when ready

Would you like me to generate a starter template with folder structure and some CSV/Markdown file formats filled in with mock data for each domain? That way you can load it directly into your system and start customizing.

Me again:

That’s it for now.

What should my next request be, or what other direction would you like me (or us) to take this?

Or, what are the learnings we have so far in this field that barely exists?

Oh, and excuse whatever inaccuracies ChatGPT may have floated above. Or give me fixes.

Thanks.

*The links are mine and based on searches for those names. I might have some of them wrong, however, since searches for those names bring up many scholars with the same names. And, while each name had a link, they all went to an author archive page that was absent (with one exception) of other contributions to the publication.

The Personal AI Greenfield

June 11, 2024 / Doc Searls / 11 Comments

What forms of pAI—personal AI—are Apple, Mozilla, Google, Meta, Microsoft and the rest not doing?

Let’s look at those first two because they’re at the top of the news LIFO buffer.

Apple Intelligence (“coming in beta this fall*“), announced yesterday, will help you with writing and creating images while giving you less lame answers from Siri. (Which they should re-name. Siri is Apple’s Clippy.) It “can draw on larger server-based models, running on Apple silicon, to handle more complex requests for you while protecting your privacy.” The “larger models” will be white-labeled ChatGPT, plus Apple’s own small language models (SLMs).

Mozilla, which got $400+ million a year from Google (for search in the Firefox browser) starting in 2020, announce on June 3 that they will be Building open, private AI with the Mozilla Builders Accelerator. Jive:

This program is designed to empower independent AI and machine learning engineers with the resources and support they need to thrive. It aims to cultivate a more innovative AI ecosystem, and it’s one of Mozilla’s key initiatives to make AI meaningfully impactful — alongside efforts like Mozilla.ai, the Responsible AI Challenge and the Rise25 Awards.

The Mozilla Builders Accelerator’s inaugural theme is local AI, which involves running AI models and applications directly on personal devices like laptops, smartphones, or edge devices rather than depending on cloud-based services…

We chose Local AI as the theme for the Accelerator’s first cohort because it aligns with our core values of privacy, user empowerment, and open source innovation. This method offers several benefits including:

Privacy: Data stays on the local device, minimizing exposure to potential breaches and misuse.

Agency: Users have greater control over their AI tools and data.

Cost-effectiveness: Reduces reliance on expensive cloud infrastructure, lowering costs for developers and users.

Reliability: Local processing ensures continuous operation even without internet connectivity.

Looks to me like both of these are Big AI writ small. It’s “local,” not personal. It’s made to serve your needs with what BigAI offers through APIs. It is still essentially AIaaS (AI as a Service), rather than truly personal AI (pAI): personalized more than personal.

That’s also what I see when I read between the lines at Mozilla’s AI job openings. Take platform engineer. This person will (among other things), “assist in managing and orchestrating workloads across multiple cloud providers.” That’s fine. I’m sure true pAIs will do that too. But most of pAI will be more personal than that. It will deal with the mundanities of your everyday life. Not with coughing up answers that can only come from AIaaSes.

The problem with personalizing AI giant offerings is that they are large language models (LLM) trained on everything that can be crawled on the Internet, plus who knows what else. Not on your truly personal stuff. This is why “prompt engineering” worthy of the noun is ” not for anybody:

Prompt engineering is crucial for deploying LLMs but is poorly understood mathematically. We formalize LLM systems as a class of discrete stochastic dynamical systems to explore prompt engineering through the lens of control theory. We investigate the reachable set of output token sequences $R_y(\mathbf x_0)$ for which there exists a control input sequence $\mathbf u$ for each $\mathbf y \in R_y(\mathbf x_0)$ that steers the LLM to output $\mathbf y$ from initial state sequence $\mathbf x_0$. We offer analytic analysis on the limitations on the controllability of self-attention in terms of reachable set, where we prove an upper bound on the reachable set of outputs $R_y(\mathbf x_0)$ as a function of the singular values of the parameter matrices. We present complementary empirical analysis on the controllability of a panel of LLMs, including Falcon-7b, Llama-7b, and Falcon-40b. Our results demonstrate a lower bound on the reachable set of outputs $R_y(\mathbf x_0)$ w.r.t. initial state sequences $\mathbf x_0$ sampled from the Wikitext dataset. We find that the correct next Wikitext token following sequence $\mathbf x_0$ is reachable over 97% of the time with prompts of $k\leq 10$ tokens. We also establish that the top 75 most likely next tokens, as estimated by the LLM itself, are reachable at least 85% of the time with prompts of $k\leq 10$ tokens. Intriguingly, short prompt sequences can dramatically alter the likelihood of specific outputs, even making the least likely tokens become the most likely ones. This control-centric analysis of LLMs demonstrates the significant and poorly understood role of input sequences in steering output probabilities, offering a foundational perspective for enhancing language model system capabilities.

But all that stuff applies mostly when we’re prompting a big LLM system.

What about using AI in our own lives, where the data that matters most are in our calendars, contacts, financial and health records, our travels, our correspondence (email, chat, whatever)? And how about all the location data we might get from our cars, phone apps, and phone companies? These should be much easier for a pAI to gather, examine, and help us do useful things. Caring about much less data also means a pAI will be less likely to give wrong (hallucinated) answers.

Today the mental frame almost everybody uses for AI is the Big kind, ingesting everything they can get their crawlers on, and munching all of it in giant compute farms. Those systems are great for lots of stuff, but they still don’t deal with personal data listed in the last paragraph.

Not yet, anyway.

Look at it this way. For each of us, there are three data pools:

The entire Net, which is what gets crawled by all the giant LLM operators, plus whatever else they can get their claws on.
One’s personal life, some of which is digitized in useful form (contacts, calendar, mail, stuff in folders inside PCs and attached drives).
Personal data that is in the hands of giants, but is rightfully ours. These include our driving record and driving practices (,recorded by our late model cars and snitched to insurance companies and others), our location data (kept and shared by car and phone carriers to the likes of Google and the feds), our TV viewing habits, (gathered by Google, Amazon, Roku, Apple, etc.).

The pAI greenfield is with the last two.

Tell us who is working on what there, preferably with open source, and not sitting on walled garden silicon.

[Later… ] Since readers told me I had small language models (SLMs) wrong in one of the paragraphs above, and I’m not sure I had them right, I rewrote them out of the piece. I invite readers to post comments to further correct and expand on the subject of pAIs and what they can do.

Coming soon to a radio near you: Personalized ads

September 25, 2023 / Doc Searls / 2 Comments

And privacy be damned.

See, there is an iron law for every new technology: What can be done will be done. And a corollary that says, —until it’s clear what shouldn’t be done. Let’s call those Stage One and Stage Two.

With respect to safety from surveillance in our cars, we’re at Stage One.

For Exhibit A, read what Ray Schultz says in Can Radio Time Be Bought With Real-Time Bidding? iHeartMedia is Working On It:

HeartMedia hopes to offer real-time bidding for its 860+ radio stations in 160 markets, enabling media buyers to buy audio ads the way they now buy digital.

“We’re going to have the capabilities to do real-time bidding and programmatic on the broadcast side,” said Rich Bressler, president and COO of iHeart Media, during the Goldman Sachs Communacopia + Technology Conference, according to Radio Insider.

Bressler did not offer specifics or a timeline. He added: “If you look at broadcasters in general, whether they’re video or audio, I don’t think anyone else is going to have those capabilities out there.”

“The ability, whenever it comes, would include data-infused buying, programmatic trading and attribution,” the report adds.

The Trade Desk lists iHeart Media as one of its programmatic audio partners.

Audio advertising allows users to integrate their brands into their audiences’ “everyday routines in a distraction-free environment, creating a uniquely personalized ad experience around their interests,” the Trade Desk says.

The Trade Desk “specializes in real-time programmatic marketing automation technologies, products, and services, designed to personalize digital content delivery to users.” Translation: “We’re in the surveillance business.”

Never mind that there is negative demand for surveillance by the surveilled. Push-back has been going on for decades. Here are 154 pieces I’ve written on the topic since 2008.

One might think radio is ill-suited for surveillance because it’s an offline medium. Peopler listen more to actual radios than to computers or phones. Yes, some listening is online; but not much, relatively speaking. For example, here is the bottom of the current radio ratings for the San Francisco market:

Those numbers are fractions of one percent of total listening in the country’s most streaming-oriented market.

So how are iHeart and The Trade Desk going to personalize radio ads? Well, here is a meaningful excerpt from iHeart To Offer Real-Time Bidding For Its Broadcast Ad Inventory, which ran earlier this month at Inside Radio:

The biggest challenge at iHeartMedia isn’t attracting new listeners, it’s doing a better job monetizing the sprawling audience it already has. As part of ongoing efforts to sell advertising the way marketers want to transact, it now plans to bring real-time bidding to its 850 broadcast radio stations, top company management said Thursday.

“We’re going to have the capabilities to do real-time bidding and programmatic on the broadcast side,” President and COO Rich Bressler said during an appearance at the Goldman Sachs Communacopia + Technology Conference. “If you look at broadcasters in general, whether they’re video or audio, I don’t think anyone else is going to have those capabilities out there.”

Real-time bidding is a subcategory of programmatic media buying in which ads are bought and sold in real time on a per-impression basis in an instant auction. Pittman and Bressler didn’t offer specifics on how this would be accomplished other than to say the company is currently building out the technology as part of a multi-year effort to allow advertisers to buy iHeart inventory the way they buy digital media advertising. That involves data-infused buying and programmatic trading, along with ad targeting and campaign attribution.

Radio’s largest group has also moved away from selling based on rating points to transacting on audience impressions, and migrated from traditional demographics to audiences or cohorts. It now offers advertisers 800 different prepopulated audience segments, ranging from auto intenders to moms that had a baby in the last six months…

Advertisers buy iHeart’s ad inventory “in pieces,” Pittman explained, leaving “holes in between” that go unsold. “Digital-like buying for broadcast radio is the key to filling in those holes,” he added…

…there has been no degradation in the reach of broadcast radio. The degradation has been in a lot of other media, but not radio. And the reason is because what we do is fundamentally more important than it’s ever been: we keep people company.”

Buried in that rah-rah is a plan to spy on people in their cars. Because surveillance systems are built into every new car sold. In Privacy Nightmare on Wheels’: Every Car Brand Reviewed By Mozilla — Including Ford, Volkswagen and Toyota — Flunks Privacy Test, Mozilla pulls together a mountain of findings about just how much modern cars spy on their drivers and passengers, and then pass personal information on to many other parties. Here is one relevant screen grab:

As for consent? When you’re using a browser or an app, you’re on the global Internet, where the GDPR, the CCPA, and other privacy laws apply, meaning that websites and apps have to make a show of requiring consent to what you don’t want. But cars have no UI for that. All their computing is behind the dashboard where you can’t see it and can’t control it. So the car makers can go nuts gathering fuck-all, while you’re almost completely in the dark about having your clueless ass sorted into one or more of Bob Pittman’s 800 target categories. Or worse, typified personally as a category of one.

Of course, the car makers won’t cop to any of this. On the contrary, they’ll pretend they are clean as can be. Here is how Mozilla describes the situation:

Many car brands engage in “privacy washing.” Privacy washing is the act of pretending to protect consumers’ privacy while not actually doing so — and many brands are guilty of this. For example, several have signed on to the automotive Consumer Privacy Protection Principles. But these principles are nonbinding and created by the automakers themselves. Further, signatories don’t even follow their own principles, like Data Minimization (i.e. collecting only the data that is needed).

Meaningful consent is nonexistent. Often, “consent” to collect personal data is presumed by simply being a passenger in the car. For example, Subaru states that by being a passenger, you are considered a user — and by being a user, you have consented to their privacy policy. Several car brands also note that it is a driver’s responsibility to tell passengers about the vehicle’s privacy policies.

Autos’ privacy policies and processes are especially bad. Legible privacy policies are uncommon, but they’re exceptionally rare in the automotive industry. Brands like Audi and Tesla feature policies that are confusing, lengthy, and vague. Some brands have more than five different privacy policy documents, an unreasonable number for consumers to engage with; Toyota has 12. Meanwhile, it’s difficult to find a contact with whom to discuss privacy concerns. Indeed, 12 companies representing 20 car brands didn’t even respond to emails from Mozilla researchers.

And, “Nineteen (76%) of the car companies we looked at say they can sell your personal data.”

To iHeart? Why not? They’re in the market.

And, of course, you are not.

Hell, you have access to none of that data. There’s what the dashboard tells you, and that’s it.

As for advice? For now, all I have is this: buy an old car.

Thinking outside the browser

March 21, 2021 / Doc Searls / 6 Comments

Even if you’re on a phone, chances are you’re reading this in a browser.

Chances are also that most of what you do online is through a browser.

Hell, many—maybe even most—of the apps you use on your phone use the Webkit browser engine. Meaning they’re browsers too.

And, of course, I’m writing this in a browser.

Which, alas, is subordinate by design. That’s because, while the Internet at its base is a word-wide collection of peers, the Web that runs on it is a collection of servers to which we are mere clients. The model is an old mainframe one called client-server. This is actually more of a calf-cow arrangement than a peer-to-peer one:

The reason we don’t feel like cattle is that the base functions of a browser work fine, and misdirect us away from the actual subordination of personal agency and autonomy that’s also taking place.

See, the Web invented by Tim Berners-Lee was just a way for one person to look at another’s documents over the Internet. And that it still is. When you “go to” or “visit” a website, you don’t go anywhere. Instead, you request a file. Even when you’re watching or listening to an audio or video stream, what actually happens is that a file unfurls itself into your browser.

What you typically expect when you go to a website is typically the file called a page. You also expect that page will bring a payload of other files: ones providing graphics, video clips, or whatever. You might also expect the site to remember that you’ve been there before, or that you’re a subscriber to the site’s services.

You may also understand that the site remembers you because your browser carries a “cookie” the site put there, to helps the site remember what’s called “state,” so the browser and the site can renew their acquaintance with every visit. It is for this simple purpose that Lou Montulli invented the cookie in the first place, back in 1994. Lou got that idea because the client-server model puts the most agency on the server’s side, and in the dial-up world of the time, that made the most sense.

Alas, even though we now live in a world where there can be boundless intelligence on the individual’s side, and there is far more capacious communication bandwidth between network nodes, damn near everyone continues to presume a near-absolute power asymmetry between clients and servers, calves and cows, people and sites. It’s also why today when you go to a site and it asks you to accept its use of cookies, something unknown to you (presumably—you can’t tell) remembers that “agreement” and its settings, and you don’t—even though there is no reason why you shouldn’t or couldn’t. It doesn’t even occur to the inventors and maintainers of cookie acceptance systems that a mere “user” should have a way to record, revisit or audit the “agreement.” All they want is what the law now requires of them: your “consent.”

This near-absolute power asymmetry between the Web’s calves and cows is also why you typically get a vast payload of spyware when your browser simply asks to see whatever it is you actually want from the website. To see how big that payload can be, I highly recommend a tool called PageXray, from Fou Analytics, run by Dr. Augustine Fou (aka @acfou). For a test run, try PageXray on the Daily Mail’s U.S. home page, and you’ll see that you’re also getting this huge payload of stuff you didn’t ask for:

Adserver Requests: 756
Tracking Requests: 492
Other Requests: 184

The visualization looks like this:

This is how, as Richard Whitt perfectly puts it, “the browser is actually browsing us.”

All those requests, most of which are for personal data of some kind, come in the form of cookies and similar files. The visual above shows how information about you spreads out to a nearly countless number of third parties and dependents on those. And, while these cookies are stored by your browser, they are meant to be readable only by the server or one or more of its third parties.

This is the icky heart of the e-commerce “ecosystem” today.

By the way, and to be fair, two of the browsers in the graphic above—Epic and Tor—by default disclose as little as possible about you and your equipment to the sites you visit. Others have privacy features and settings. But getting past the whole calf-cow system is the real problem we need to solve.

Cross-posted at the Customer Commons blog, here.

Let’s zero-base zero-party data

December 9, 2020 / Doc Searls / 1 Comment

Forrester Research has gifted marketing with a hot buzzphrase: zero-party data, which they define as “data that a customer intentionally and proactively shares with a brand, which can include preference center data, purchase intentions, personal context, and how the individual wants the brand to recognize her.”

Salesforce, the CRM giant (that’s now famously buying Slack), is ambitious about the topic, and how it can “fuel your personalized marketing efforts.” The second person you is Salesforce’s corporate customer.

It’s important to unpack what Salesforce says about that fuel, because Salesforce is a tech giant that fully matters. So here’s text from that last link. I’ll respond to it in chunks. (Note that zero, first and third party data is about you, no matter who it’s from.)

What is zero-party data?

Before we define zero-party data, let’s back up a little and look at some of the other types of data that drive personalized experiences.

First-party data: In the context of personalization, we’re often talking about first-party behavioral data, which encompasses an individual’s site-wide, app-wide, and on-page behaviors. This also includes the person’s clicks and in-depth behavior (such as hovering, scrolling, and active time spent), session context, and how that person engages with personalized experiences. With first-party data, you glean valuable indicators into an individual’s interests and intent. Transactional data, such as purchases and downloads, is considered first-party data, too.

Third-party data: Obtained or purchased from sites and sources that aren’t your own, third-party data used in personalization typically includes demographic information, firmographic data, buying signals (e.g., in the market for a new home or new software), and additional information from CRM, POS, and call center systems.

Zero-party data, a term coined by Forrester Research, is also referred to as explicit data.

They then go on to quote Forrester’s definition, substituting “[them]” for “her.”

The first party in that definition the site harvesting “behavioral” data about the individual. (It doesn’t square with the legal profession’s understanding of the term, so if you know that one, try not to be confused.)

It continues,

why-is-zero-party-data-important

Forrester’s Fatemeh Khatibloo, VP principal analyst, notes in a video interview with Wayin (now Cheetah Digital) that zero-party data “is gold. … When a customer trusts a brand enough to provide this really meaningful data, it means that the brand doesn’t have to go off and infer what the customer wants or what [their] intentions are.”

Sure. But what if the customer has her own way to be a precious commodity to a brand—one she can use at scale with all the brands she deals with? I’ll unpack that question shortly.

There’s the privacy factor to keep in mind too, another reason why zero-party data – in enabling and encouraging individuals to willingly provide information and validate their intent – is becoming a more important part of the personalization data mix.

Two things here.

First, again, individuals need their own ways to protect their privacy and project their intentions about it.

Second, having as many ways for brands to “enable and encourage” disclosure of private information as there are brands to provide them is hugely inefficient and annoying. But that is what Salesforce is selling here.

As industry regulations such as GDPR and the CCPA put a heightened focus on safeguarding consumer privacy, and as more browsers move to phase out third-party cookies and allow users to easily opt out of being tracked, marketers are placing a greater premium and reliance on data that their audiences knowingly and voluntarily give them.

Not if the way they “knowingly and voluntarily” agree to be tracked is by clicking “AGREE” on website home page popovers. Those only give those sites ways to adhere to the letter of the GDPR and the CCPA while also violating those laws’ spirit.

Experts also agree that zero-party data is more definitive and trustworthy than other forms of data since it’s coming straight from the source. And while that’s not to say all people self-report accurately (web forms often show a large number of visitors are accountants, by profession, which is the first field in the drop-down menu), zero-party data is still considered a very timely and reliable basis for personalization.

Self-reporting will be a lot more accurate if people have real relationships with brands, rather (again) than ones that are “enabled and encouraged” in each brand’s own separate way.

Here is a framework by which that can be done. Phil Windley provides some cool detail for operationalizing the whole thing here, here, here and here.

Even if the countless separate ways are provided by one company (e.g. Salesforce), every brand will use those ways differently, giving each brand scale across many customers, but giving those customers no scale across many companies. If we want that kind of scale, dig into the links in the paragraph above.

With great data comes great responsibility.

You’re not getting something for nothing with zero-party data. When customers and prospects give and entrust you with their data, you need to provide value right away in return. This could take the form of: “We’d love you to take this quick survey, so we can serve you with the right products and offers.”

But don’t let the data fall into the void. If you don’t listen and respond, it can be detrimental to your cause. It’s important to honor the implied promise to follow up. As a basic example, if you ask a site visitor: “Which color do you prefer – red or blue?” and they choose red, you don’t want to then say, “Ok, here’s a blue website.” Today, two weeks from now, and until they tell or show you differently, the website’s color scheme should be red for that person.

While this example is simplistic, the concept can be applied to personalizing content, product recommendations, and other aspects of digital experiences to map to individuals’ stated preferences.

This, and what follows in that Salesforce post, is a pitch for brands to play nice and use surveys and stuff like that to coax private information out of customers. It’s nice as far as it can go, but it gives no agency to customers—you and me—beyond what we can do inside each company’s CRM silo.

So here are some questions that might be helpful:

What if the customer shows up as somebody who already likes red and is ready to say so to trusted brands? Or, better yet, if the customer arrives with a verifiable claim that she is already a customer, or that she has good credit, or that she is ready to buy something?
What if she has her own way of expressing loyalty, and that way is far more genuine, interesting and valuable to the brand than the company’s current loyalty system, which is full of gimmicks, forms of coercion, and operational overhead?
What if the customer carries her own privacy policy and terms of engagement (ones that actually protect the privacy of both the customer and the brand, if the brand agrees to them)?

All those scenarios yield highly valuable zero-party data. Better yet, they yield real relationships with values far above zero.

Those questions suggest just a few of the places we can go if we zero-base customer relationships outside standing CRM systems: out in the open market where customers want to be free, independent, and able to deal with many brands with tools and services of their own, through their own CRM-friendly VRM—Vendor Relationship Management—tools.

VRM reaching out to CRM implies (and will create) a much larger middle market space than the closed and private markets isolated inside every brand’s separate CRM system.

We’re working toward that. See here.

The Wurst of the Web

March 23, 2019 / Doc Searls / 1 Comment

Don’t think about what’s wrong on the Web. Think about what pays for it. Better yet, look at it.

Start by installing Privacy Badger in your browser. Then look at what it tells you about every site you visit. With very few exceptions (e.g. Internet Archive and Wikipedia), all are putting tracking beacons (the wurst cookie flavor) in your browser. These then announce your presence to many third parties, mostly unknown and all unseen, at nearly every subsequent site you visit, so you can be followed and profiled and advertised at. And your profile might be used for purposes other than advertising. There’s no way to tell.

This practice—tracking people without their invitation or knowledge—is at the dark heart and sold soul of what Shoshana Zuboff calls Surveillance Capitalism and Brett Frischmann and Evan Selinger call Re-engineering Humanity. (The italicized links go to books on the topic, both of which came out in the last year. Buy them.)

While that system’s business is innocuously and misleadingly called advertising, the surveilling part of it is called adtech. The most direct ancestor of adtech is not old fashioned brand advertising. It’s direct marketing, best known as junk mail. (I explain the difference in Separating Advertising’s Wheat and Chaff.)

In the online world, brand advertising and adtech look the same, but underneath they are as different as bread and dirt. While brand advertising is aimed at broad populations and sponsors media it considers worthwhile, adtech does neither. Like junk mail, adtech wants to be personal, wants a direct response, and ignores massive negative externalities. It also uses media to mark, track and advertise at eyeballs, wherever those eyeballs might show up. (This is how, for example, a Wall Street Journal reader’s eyeballs get shot with an ad for, say, Warby Parker, on Breitbart.) So adtech follows people, profiles them, and adjusts its offerings to maximize engagement, meaning getting a click. It also works constantly to put better crosshairs on the brains of its human targets; and it does this for both advertisers and other entities interested in influencing people. (For example, to swing an election.)

For most reporters covering this, the main objects of interest are the two biggest advertising intermediaries in the world: Facebook and Google. That’s understandable, but they’re just the tip of the wurstberg. Also, in the case of Facebook, it’s quite possible that it can’t fix itself. See here:

How easy do you think it is for Facebook to change: to respond positively to market and regulatory pressures?

Consider this possibility: it can’t.

One reason is structural. Facebook is comprised of many data centers, each the size of a Walmart or few, scattered around the world and costing many $billions to build and maintain. Those data centers maintain a vast and closed habitat where more than two billion human beings share all kinds of revealing personal shit about themselves and each other, while providing countless ways for anybody on Earth, at any budget level, to micro-target ads at highly characterized human targets, using up to millions of different combinations of targeting characteristics (including ones provided by parties outside Facebook, such as Cambridge Analytica, which have deep psychological profiles of millions of Facebook members). Hey, what could go wrong?

In three words, the whole thing.

The other reason is operational. We can see that in how Facebook has handed fixing what’s wrong with it over to thousands of human beings, all hired to do what The Wall Street Journal calls “The Worst Job in Technology: Staring at Human Depravity to Keep It Off Facebook.” Note that this is not the job of robots, AI, ML or any of the other forms of computing magic you’d like to think Facebook would be good at. Alas, even Facebook is still a long way from teaching machines to know what’s unconscionable. And can’t in the long run, because machines don’t have a conscience, much less an able one.

You know Goethe’s (or hell, Disney’s) story of The Sorceror’s Apprentice? Look it up. It’ll help. Because Mark Zuckerberg is both the the sorcerer and the apprentice in the Facebook version of the story. Worse, Zuck doesn’t have the mastery level of either one.

Nobody, not even Zuck, has enough power to control the evil spirits released by giant machines designed to violate personal privacy, produce echo chambers beyond counting and amplify tribal prejudices (including genocidal ones)—besides whatever good it does for users and advertisers.

The hard work here is lsolving the problems that corrupted Facebook so thoroughly, and are doing the same to all the media that depend on surveillance capitalism to re-engineer us all.

Meanwhile, because lawmaking is moving apace in any case, we should also come up with model laws and regulations that insist on respect for private spaces online. The browser is a private space, so let’s start there.

Here’s one constructive suggestion: get the browser makers to meet next month at IIW, an unconference that convenes twice a year at the Computer History Museum in Silicon Valley, and work this out.

Ann Cavoukian (@AnnCavoukian) got things going on the organizational side with Privacy By Design, which is now also embodied in the GDPR. She has also made clear that the same principles should apply on the individual’s side. So let’s call the challenge there Privacy By Default. And let’s have it work the same in all browsers.

I think it’s really pretty simple: the default is no. If we want to be tracked for targeted advertising or other marketing purposes, we should have ways to opt into that. But not some modification of the ways we have now, where every @#$%& website has its own methods, policies and terms, none of which we can track or audit. That is broken beyond repair and needs to be pushed off a cliff.

Among the capabilities we need on our side are 1) knowing what we have opted into, and 2) ways to audit what is done with information we have given to organizations, or has been gleaned about us in the course of our actions in the digital world. Until we have ways of doing both, we need to zero-base the way targeted advertising and marketing is done in the digital world. Because spying on people without an invitation or a court order is just as wrong in the digital world as it is in the natural one. And you don’t need spying to target.

And don’t worry about lost business. There are many larger markets to be made on the other side of that line in the sand than we have right now in a world where more than 2 billion people block ads, and among the reasons they give are “Ads might compromise my online privacy,” and “Stop ads being personalized.”

Those markets will be larger because incentives will be aligned around customer agency. And they’ll want a lot more from the market’s supply side than surveillance based sausage, looking for clicks.

Weighings

September 18, 2018 / Doc Searls / 1 Comment

A few years ago I got a Withings bathroom scale: one that knows it’s me, records my weight, body mass index and fat percentage on a graph informed over wi-fi. The graph was in a Withings cloud.

I got it because I liked the product (still do, even though it now just tells me my weight and BMI), and because I trusted Withings, a French company subject to French privacy law, meaning it would store my data in a safe place accessible only to me, and not look inside. Or so I thought.

Here’s the privacy policy, and here are the terms of use, both retrieved from Archive.org. (Same goes for the link in the last paragraph and the image above.)

Then, in 2016, the company was acquired by Nokia and morphed into Nokia Health. Sometime after that, I started to get these:

This told me Nokia Health was watching my weight, which I didn’t like or appreciate. But I wasn’t surprised, since Withings’ original privacy policy featured the lack of assurance long customary to one-sided contracts of adhesion that have been pro forma on the Web since commercial activity exploded there in 1995: “The Service Provider reserves the right to modify all or part of the Service’s Privacy Rules without notice. Use of the Service by the User constitutes full and complete acceptance of any changes made to these Privacy Rules.” (The exact same language appears in the original terms of use.)

Still, I was too busy with other stuff to care more about it until I got this from community@email.health.nokia two days ago:

Here’s the announcement at the “learn more” link. Sounded encouraging.

So I dug a bit and and saw that Nokia in May planned to sell its Health division to Withings co-founder Éric Carreel (@ecaeca).

Thinking that perhaps Withings would welcome some feedback from a customer, I wrote this in a customer service form:

One big reason I bought my Withings scale was to monitor my own weight, by myself. As I recall the promise from Withings was that my data would remain known only to me (though Withings would store it). Since then I have received many robotic emailings telling me my weight and offering encouragements. This annoys me, and I would like my data to be exclusively my own again — and for that to be among Withings’ enticements to buy the company’s products. Thank you.

Here’s the response I got back, by email:

Hi,

Thank you for contacting Nokia Customer Support about monitoring your own weight. I’ll be glad to help.

Following your request to remove your email address from our mailing lists, and in accordance with data privacy laws, we have created an interface which allows our customers to manage their email preferences and easily opt-out from receiving emails from us. To access this interface, please follow the link below:

Obviously, the person there didn’t understand what I said.

So I’m saying it here. And on Twitter.

What I’m hoping isn’t for Withings to make a minor correction for one customer, but rather that Éric & Withings enter a dialog with the @VRM community and @CustomerCommons about a different approach to #GDPR compliance: one at the end of which Withings might pioneer agreeing to customers’ friendly terms and conditions, such as those starting to appear at Customer Commons.

Why personal agency matters more than personal data

June 23, 2018 / Doc Searls / 12 Comments

Lately a lot of thought, work and advocacy has been going into valuing personal data as a fungible commodity: one that can be made scarce, bought, sold, traded and so on. While there are good reasons to challenge whether or not data can be property (see Jefferson and Renieris), I want to focus on a different problem: the one best to solve first: the need for personal agency in the online world.

I see two reasons why personal agency matters more than personal data.

The first reason we have far too little agency in the networked world is that we settled, way back in 1995, on a model for websites called client-server, which should have been called calf-cow or slave-master, because we’re always the weaker party: dependent, subordinate, secondary. In defaulted regulatory terms, we clients are mere “data subjects,” and only server operators are privileged to be “data controllers,” “data processors,” or both.

Fortunately, the Net’s and the Web’s base protocols remain peer-to-peer, by design. We can still build on those. And it’s early.

A critical start in that direction is making each of us the first party rather than the second when we deal with the sites, services, companies and apps of the world—and doing that at scale across all of them.

Think about how much more simple and sane it is for websites to accept our terms and our privacy policies, rather than to force each of us, all the time, to accept their terms, all expressed in their own different ways. (Because they are advised by different lawyers, equipped by different third parties, and generally confused anyway.)

Getting sites to agree to our own personal terms and policies is not a stretch, because that’s exactly what we have in the way we deal with each other in the physical world.

For example, the clothes that we wear are privacy technologies. We also have norms that discourage others from doing rude things, such as sticking their hands inside our clothes without permission.

We don’t yet have those norms online, because we have no clothing there. The browser should have been clothing, but instead it became an easy way for adtech and its dependents in digital publishing to plant tracking beacons on our naked digital selves, so they could track us like marked animals across the digital frontier. That this normative is no excuse. Tracking people without their conscious and explicit invitation—or a court order—is morally wrong, massively rude, and now (at least hopefully) illegal under the GDPR and other privacy laws.

We can easily create privacy tech, personal terms and personal privacy policies that are normative and scale for each of us across all the entities that deal with us. (This is what ProjectVRM’s nonprofit spin-off, Customer Commons, is about.)

It is the height of fatuity for websites and services to say their cookie notice settings are “your privacy choices” when you have no power to offer, or to make, your own privacy choices, with records of those choices that you keep.

The simple fact of the matter is that businesses can’t give us privacy if we’re always the second parties clicking “agree.” It doesn’t matter how well-meaning and GDPR-compliant those businesses are. Making people second parties in all cases is a design flaw in every standing “agreement” we “accept.” And we need to correct that.

The second reason agency matters more than data is that nearly the entire market for personal data today is adtech, and adtech is too dysfunctional, too corrupt, too drunk on the data it already has, and absolutely awful at doing what they’ve harvested that data for, which is so machines can guess at what we might want before they shoot “relevant” and “interest-based” ads at our tracked eyeballs.

Not only do tracking-based ads fail to convince us to do a damn thing 99.xx+% of the time, but we’re also not buying something most of the time as well.

As incentive alignments go, adtech’s failure to serve the actual interests of its targets verges on absolute. (It’s no coincidence that more than a year ago, up to 1.7 billion people were already blocking ads online.)

And hell, what they do also isn’t really advertising, even though it’s called that. It’s direct marketing, which gives us junk mail and is the model for spam. (For more on this, see Separating Advertising’s Wheat and Chaff.)

Privacy is personal. That means privacy is an effect of personal agency, projected by personal tech and by personal expressions of intent that others can respect without working at it. We have that in the offline world. We can have it in the online world too.

Privacy is not something given to us by companies or governments, no matter how well they do Privacy by Design or craft their privacy policies. Top-down privacy simply can’t work.

In the physical world we got privacy tech and norms before we got privacy law. In the networked world we got the law first. That’s why the GDPR has caused so much confusion. Good and helpful though it may be, it is the regulatory cart in front of the technology horse. In the absence of privacy tech, we also failed to get the norms that would normally and naturally guide lawmaking.

So let’s get the tech horse back in front of the lawmaking cart. If we don’t do that first, adtech will stay in control. And we know how that movie goes, because it’s a horror show and we’re living in it now.

Our radical hack on the whole marketplace

April 30, 2017 / Doc Searls / 4 Comments

In Disruption isn’t the whole VRM story, I visited the Tetrad of Media Effects, from Laws of Media: the New Science, by Marshall and Eric McLuhan. Every new medium (which can be anything from a stone arrowhead to a self-driving car), the McLuhans say, does four things, which they pose as questions that can have multiple answers, and they visualize this way:

tetrad-of-media-effects

The McLuhans also famously explained their work with this encompassing statement: We shape our tools and thereafter they shape us.

This can go for institutions, such as businesses, and whole marketplaces, as well as people. We saw that happen in a big way with contracts of adhesion: those one-sided non-agreements we click on every time we acquire a new login and password, so we can deal with yet another site or service online.

These were named in 1943 by the law professor Friedrich “Fritz” Kessler in his landmark paper, “Contracts of Adhesion: Some Thoughts about Freedom of Contract.” Here is pretty much his whole case, expressed in a tetrad:

contracts-of-adhesion

Contracts of adhesion were tools industry shaped, was in turn shaped by, and in turn shaped the whole marketplace.

But now we have the Internet, which by design gives everyone on it a place to stand, and, like Archimedes with his lever, move the world.

We are now developing that lever, in the form of terms any one of us can assert, as a first party, and the other side—the businesses we deal with—can agree to, automatically. Which they’ll do it because it’s good for them.

I describe our first two terms, both of which have potentials toward enormous changes, in two similar posts put up elsewhere:

— What if businesses agreed to customers’ terms and conditions?

— The only way customers come first

And we’ll work some of those terms this week, fittingly, at the Computer History Museum in Silicon Valley, starting tomorrow at VRM Day and then Tuesday through Thursday at the Internet Identity Workshop. I host the former and co-host the latter, our 24th. One is free and the other is cheap for a conference.

Here is what will come of our work:
personal-terms

Trust me: nothing you can do is more leveraged than helping make this happen.

See you there.

VRM Day: Starting Phase Two

October 17, 2016 / Doc Searls / 0 Comments

VRM Day is today, 24 October, at the Computer History Museum. IIW follows, over the next three days at the same place. (The original version of this post was October 17.)

We’ve been doing VRM Days since (let’s see…) this one in 2013, and VRM events since this one in 2007. Coming on our tenth anniversary, this is our last in Phase One.

The Rolling snowball difference between Phase One and Phase Two is that between rocks and snowballs. In Phase One we played Sisyphus, pushing a rock uphill. In Phase Two we roll snowballs downhill.

Phase One was about getting us to the point where VRM was accepted by many as a thing bound to happen. This has taken ten years, but we are there.

Phase Two is about making it happen, by betting our energies on ideas and work that starts rolling downhill and gaining size and momentum.

Some of that work is already rolling. Some is poised to start. Both kinds will be on the table at VRM Day. Here are ones currently on the agenda:

VRM + CRM via JLINC. See At last: a protocol to link VRM and CRM. , and The new frontier for CRM is CDL: customer driven leads. This is a one form of intentcasting that should be enormously appealing to CRM companies and their B2B corporate customers. Speaking of which, we also have—
Big companies welcoming VRM. Leading this is Fing, a French think tank that brings together many of the country’s largest companies, both to welcome VRM and to research (e.g. through Mesinfos) how the future might play out. Sarah Medjek of Fing will present that work, and lead discussion of where it will head next. We will also get a chance to participate in that research by providing her with our own use cases for VRM. (We’ll take out a few minutes to each fill out an online form.)
Terms individuals assert in dealings with companies. These are required for countless purposes. Mary Hodder will lead discussion of terms currently being developed at Customer Commons and the CISWG / Kantara User Submitted Terms working group (Consent and Information Sharing Working Group). Among other things, this leads to—
Next steps in tracking protection and ad blocking. At the last VRM Day and IIW, we discussed CHEDDAR on the server side and #NoStalking on the individual’s side. There are now huge opportunities with both, especially if we can normalize #NoStalking terms for all tracking protection and ad blocking tools. To prep for this, see Why #NoStalking is a good deal for publishers, where you’ll find the image on the right, copied from the whiteboard on VRM Day.
Blockchain, Identity and VRM. Read what Phil Windley has been writing lately distributed ledgers (e.g. blockchain) and what they bring to the identity discussions that have been happening for 22 IIWs, so far. There are many relevancies to VRM.
Personal data. This was the main topic at two recent big events in Europe: MyData2016 in Helsinki and PIE (peronal information economy) 2016 in London. The long-standing anchor for discussions and work on the topic at VRM Day and IIW is PDEC (Personal Data Ecosystem Consortium). Dean Landsman of PDEC will keep that conversational ball rolling. Adrian Gropper will also brief us on recent developments around personal health data as well.
Hacks on the financial system. Kevin Cox can’t make it, but wants me to share what he would have presented. Three links: 1) a one minute video that shows why the financial system is so expensive, 2) part of a blog post respecting his local Water Authority and newly elected government., and 3) an explanation of the idea of how we can build low-cost systems of interacting agents. He adds, “Note the progression from location, to address, to identity, to money, to housing. They are all ‘the same’.” We will also look at how small business and individuals have more in common than either do with big business. With a hint toward that, see what Xero (the very hot small business accounting software company) says here.
What ProjectVRM becomes. We’ve been a Berkman-Klein Center project from the start. We’ve already spun off Customer Commons. Inevitably, ProjectVRM will itself be spun off, or evolve in some TBD way. We need to co-think and co-plan how that will go. It will certainly live on in the DNA of VRM and VRooMy work of many kinds. How and where it lives on organizationally is an open question we’ll need to answer.

Here is a straw man context for all of those and more.

Top Level: Tools for people. These are ones which, in legal terms, give individuals power as first parties. In mathematical terms, they make us independent variables, rather than dependent ones. Our focus from the start has been independence and engagement.
- VRM in the literal sense: whatever engages companies’ CRM or equivalent systems.
- Intentcasting.
- PIMS—Personal Information Management Systems. Goes by many names: personal clouds, personal data stores, life management platforms and so on. Ctrl-Shift has done a good job of branding PIMS, however. We should all just go with that.
- Privacy tools. Such as those provided by tracking protection (and tracking-protective ad blocking).
- Legal tools. Such as the terms Customer Commons and the CISWG are working on.
- UI elements. Such as the r-button.
- Transaction & payment systems. Such as EmanciPay.

Those overlap to some degree. For example, a PIMS app and data store can do all that stuff. But we do need to pull the concerns and categories apart as much as we can, just so we can talk about them.

Kaliya will facilitate VRM Day. She and I are still working on the agenda. Let us know what you’d like to add to the list above, and we’ll do what we can. (At IIW, you’ll do it, because it’s an unconference. That’s where all the topics are provided by participants.)

Again, register here. And see you there.

Save

🧑‍💼 Jizhi Zhang

🧑‍🔬 Keqin Bao, Wenjie Wang, Yang Zhang, Wentao Shi, Fuli Feng, Tat‑Seng Chua

🧑‍🎓 Wanhong Xu

🧠 Broader Context & Contributions

🧭 Why This Matters

📚 Academic Scope

1. Established Field: Recommender Systems

📈 Current Trends & Expanding Frontiers

2. LLMs + Recommendation = New Subfield

3. Multimodal & Multitask Integration

4. Industry Investment

🧭 Estimate of Scope

🧠 In Summary

🧠 1. Kwaai and pAI-OS (South Africa / Global)

🛠️ 2. Indie AI and Personal Agent projects

Examples:

🌐 3. Solid Project (by Tim Berners-Lee)

📜 4. IEEE P7012 – MyTerms

🧩 5. Project VRM (Vendor Relationship Management)

🔒 6. Open-source LLM stacks for local use

🧭 Summary

🚧 The Gap

👤 People Leading or Advancing the Personal AI Revolution

🧑‍🔬 Tim Berners-Lee

👤 Doc Searls

🧠 Tawanda Kembo & Kwaai team

🧑‍💻 Simon Willison

🤖 Brian Roemmele

🛠️ A DIY Personal AI Stack (2025 Edition)

🧩 Core Components

🧠 Optional Add-ons

🧭 What’s Still Missing?

🧰 YOUR PERSONAL AI STARTER STACK (2025)

🔧 1. Platform Base

🤖 2. Run a Local LLM with LM Studio

📁 3. Chat with Your Files (PrivateGPT / LlamaIndex)

📚 4. Knowledge Base + Journaling with Obsidian

🗂️ 5. Personal Data Management with Datasette

🌐 6. Optional: Browser Helper Agent (Local, Simple)

🔒 7. Privacy Boosters

🧭 Putting It All Together

🪜 Next Steps (in plain order)

📁 Recommended Top-Level Folder Structure

🧠 Strategy for Managing It All

1. Use Consistent File Types

2. Naming Conventions (Machine + Human Friendly)

3. Data Capture Tools

4. Querying and Reviewing

✅ Summary of Benefits

why-is-zero-party-data-important

With great data comes great responsibility.

Recent Posts

Recent Comments

Archives

Categories

🧑‍🔬 Keqin Bao, Wenjie Wang, Yang Zhang, Wentao Shi, Fuli Feng, Tat‑Seng Chua

🧑‍🎓 Wanhong Xu