How to Opt Out of AI Training on Major Platforms
Table of Contents
"You have the right to remain silent. Anything you say can and will be used .. for training AI!"
These days every popular online service collects data for training their own AI by default. If you feel uncomfortable in providing your data for AI, ther is a way to opt out! See the list of the popular services and instructions on how to opt out:
You can disable using your data for AI training on this page. Note: If you don't see this option and you are located in the European Union, it means your data is not being used due to your location and because of European Artificial Intelligence Act (AI Act) adopted in 2024.

Meta (Facebook and Instagram)
Back to ContentsThere is no off switch but you may send a request through this form.

X (aka Twitter)
Back to ContentsYou can disable use of your data for training Grok AI model on this page.

Open AI ChatGPT
Back to ContentsIMPORTANT: If you use the service without logging in, then your inputs and outputs are collected and used for AI training by default.
- Login with your account at https://chatgpt.com, and click on your profile icon (right-top icon).
- Choose settings.
- Then select "Data Controls", then "Improve the model for everyone" and set to "Off"

Anthropic Claude
Back to ContentsIMPORTANT: If you use the service without logging in, then your inputs and outputs are collected and used for AI training by default.
No separate switch to turn off using your data for AI training. Also, Anthropic currently (as of September 2024) says that "Anthropic may not train models on Customer Content from paid Services." if you use their paid product (see their commercial terms).
Also, Anthropic reserves the right to store your input and output from 2 to 7 years in case it was automatically flagged by their trust and safety classifiers as violating Anthropic's Usage Policy. More information here.
If you delete a conversation then Anthropic automatically deletes user prompts and outputs on its backend within 30 days, unless you submit a separate data deletion request. More information here.

Perplexity AI
Back to ContentsIMPORTANT: If you use the service without logging in, then your inputs and outputs are collected and used for AI training by default.
Go to Perplexity AI settings and turn off the "AI Data Retention" switch.

IMPORTANT: If you use the service without logging in, then your inputs and outputs are collected and used for AI training by default.
Reddit does not have a separate switch to exclude your data from AI training. According to Reddit's Terms of Use, by using Reddit you provide them with "...worldwide, royalty-free, perpetual, irrevocable, non-exclusive, transferable, and sublicensable license to use, copy, modify, adapt, prepare derivative works of, distribute, store, perform, and display Your Content and any name, username, voice, or likeness provided in connection with Your Content in all media formats and channels now known or later developed anywhere in the world. This license includes the right for us to make Your Content available for syndication, broadcast, distribution, or publication by other companies, organizations, or individuals who partner with Reddit."
Reddit is licensing user data to third parties that may use your data to train AI (source).
However, you may remove your posts and comments if required through this page.

Wordpress.com
Back to ContentsIf you are hosting your website on Wordpress.com then your content is used by Wordpress.com for AI training by default (and also for sharing with 3rd party services).
- To turn it off you need to go to the Admin panel (dashboard) of your wordpress.com hosted website (for example,
https://YOURWEBSITE.wordpress.com/wp-admin/
). Then go to Navigate toSettings → General
(orHosting → Site Settings
if use WP-Admin). - Then scroll down to the Personalization section and enable the Prevent third-party sharing for YOURWEBSITE.wordpress.com option.
- If you run multiple websites on Wordpress.com, you need to enable this option for all of them one by one!

For more details about disabling your content from use for AI training, please visit this page: https://wordpress.com/support/privacy-settings/make-your-website-public/#prevent-third-party-sharing.
GitHub CoPilot
Back to ContentsIMPORTANT: GitHub Copilot uses public repositories with permissive licenses for training its AI models. Here are three ways to prevent GitHub Copilot from using your code for training:
- Do not use GitHub: The most effective way to ensure your code is not used for AI training is to avoid using GitHub altogether.
- Keep repositories private: By keeping your repositories private, you can prevent GitHub Copilot from accessing and using your code for training purposes.
- Use restrictive licenses for public repositories: For public repositories, you can use a restrictive license or even no license to avoid giving explicit permission for your code to be used for AI training. This can help limit the use of your code by GitHub Copilot.
If your organization uses GitHub Copilot, you can exclude certain content from being used by it. For more details, visit this page
GitHub - Your Repositories In Datasets Used By Popular LLMs
Back to ContentsMany of LLMs (Large Language Models) use Stack v2 dataset for training. This dataset is a collection of public source code in 600 programming languages and sized at about 67 terrabytes. It includes source code from GitHub repositories with permissive licenses.
You can prevent your code (from your public Github repositories) from being included into this Stack v2 dataset and therefore exclude it from being used for AI training.
- First, to check if your reposities are included into Stack dataset, please visit this page https://huggingface.co/spaces/bigcode/in-the-stack and enter your github username.
- If your repositories are included, create a new issue at https://github.com/bigcode-project/opt-out-v2/issues/new?&template=opt-out-request.md to request the exclusion of your repositories from future versions of Stack v2 dataset.
If you want to learn more about Stack v2 dataset, please visit https://huggingface.co/datasets/bigcode/the-stack-v2.
Substack
Back to ContentsIMPORTANT: By default, Substack does not block AI bots from using your content for training purposes.
To prevent AI bots from using your Substack content for training, you need to enable the "Block AI training" option. This option is available on the "Settings" page of your Substack account.
Follow these steps to enable the "Block AI training" option:
- Log in to your Substack account and navigate to the "Settings" page.
- Scroll down to the "Block AI training" section.
- Enable the "Block AI training" option by toggling the switch.
Please note that this option just updated public rules for AI bots. But some AI bots have been observed not respecting these public rules (like robots.txt
and similar rules defining access). Therefore, if your Substack page is public, some AI bots may still have access to your content despite enabling this option. Because of this, this option does not gurarntee that your content will not be used for AI training.
Add rules for AI bots to your website
Back to ContentsIf you are running a commercial website to promote your business or sell things, or providing a service, you may want to allow your website to be indexed by AI search engines so regular users could easily discover your website via AI when needed.
However, if you are running a website where you don't want your content to be captured by robots.txt
file (or you want this content to be exclusively available and known through your website only), then you first need to define robots.txt
file that defines rules for web bots.If you don't have robots.txt on your website then web bots (including ones used by AI) will consider they are allowed to read and learn from your website without any restrictions.
Here are some examples of how to use the robots.txt
file to control the access of web bots, including AI bots, to your website:
Example 1: Disallow All Bots
User-agent: *
Disallow: /
This rule tells all bots that they are not allowed to access any part of your website.
Example 2: Allow All Bots
User-agent: *
Disallow:
This rule tells all bots that they are allowed to access all parts of your website.
Example 3: Disallow Specific Bots
User-agent: CCBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
This rule tells a specific bot (in this case, "CCBot" which is used for Common Crawl (used by ChatGPT and many others) and "anthropic-ai" which is used for Anthropic) that it is not allowed to access any part of your website.
Example 4: Disallow All Bots from Specific Directories
User-agent: *
Disallow: /private/
Disallow: /tmp/
This rule tells all bots that they are not allowed to access the "/private/" and "/tmp/" directories of your website.
IMPORTANT: this rule is voluntary and some web and AI bots may ignore it and read these directories from to your website despite this rule.
Example 5: Allow Specific Bots
User-agent: Googlebot
Disallow:
User-agent: *
Disallow: /
This rule tells the Googlebot
(which is used by Google Search indexing) that it is allowed to access all parts of your website, while all other bots (defined as *
) are not allowed to access any part of your website.
By using these examples, you can customize the robots.txt
file to control which bots can access your website and which parts of your website they can access.
That's why we've created this free tool that allows you to quickly create a special file (robots.txt) that disallows AI from reading and learning from your website.
IMPORTANT: while robots.txt
file is a standard way to define rules for web bots, it is not a perfect solution. This file is more like a voluntary rules. Some AI bots were cought of not respecting these rules and still accessing and capturing content from websites despite the rules defined in robots.txt
file. For more protection (if you need it), please consider adding login/password protection to your website.
Frequently Asked Questions
1. What does it mean to "opt out" of AI training?
Opting out of AI training generally means preventing companies from using your data—such as posts, comments, or source code—to train their AI models. If you find it uncomfortable or invasive that your personal or creative content is used without explicit permission, you can follow certain steps to limit or block data collection.
2. How do I disable AI training using my LinkedIn data?
LinkedIn provides a dedicated page to disable data usage for AI. Simply visit this LinkedIn settings page and toggle off the option to use your data for AI. If this option isn’t visible (particularly in the EU), your data is not being used due to local regulations like the European Artificial Intelligence Act.
3. Is there a way to opt out of AI data usage on Meta (Facebook and Instagram)?
There’s no straightforward “off” switch for AI training on Meta’s platforms. However, you can submit a request through this Meta form. Keep in mind that this process isn’t as clear-cut as toggling a setting; it involves Meta reviewing your request for data exclusion.
4. Can I prevent X (Twitter) from using my posts for AI training?
Yes. X provides a direct way to opt out of its AI model, Grok. Visit this page after logging in, and disable the option to use your content for AI training. This ensures that your tweets are no longer fed into Grok AI’s datasets.
5. Does OpenAI ChatGPT use my data for training by default?
If you use ChatGPT without logging in, your inputs and outputs may be collected for AI model improvements. Once logged in, go to “Settings” > “Data Controls” > “Improve the model for everyone” and switch it off. This setting helps prevent OpenAI from using your conversation data to train ChatGPT.
6. Does Anthropic Claude automatically train on my conversations?
Yes, by default, Anthropic may collect user prompts and responses for training, especially in the free version. Paid users, according to Anthropic’s commercial terms, may have more safeguards. Also, flagged content can be stored from 2 to 7 years for safety checks. Deleting your conversation will remove prompts and outputs from Anthropic’s systems within 30 days, unless you submit a separate data deletion request.
7. How can I prevent Perplexity AI from retaining and training on my data?
When using Perplexity AI without an account, your data may be collected for model improvements. After logging in, visit your account settings and toggle off the “AI Data Retention” option to stop Perplexity from storing your prompts for training.
8. Why is there no single toggle to opt out on Reddit?
Reddit’s Terms of Use grant a broad license allowing the platform and partners to use user content for various purposes, including AI training. You cannot directly opt out, but you can remove your posts and comments. Check this page for deleting or accessing your Reddit data.
9. How do I stop WordPress.com from sharing my website content for AI training?
By default, WordPress.com uses your site content for AI features and may share it with third parties. Go to your WordPress.com admin dashboard (e.g., https://YOURSITE.wordpress.com/wp-admin/
) > Settings > General or Hosting > Site Settings and enable “Prevent third-party sharing”. Repeat for each of your WordPress.com sites.
10. Can I prevent GitHub Copilot from training on my code?
Copilot primarily trains on public repositories with permissive licenses. Ways to limit your code’s exposure include using private repos, refraining from GitHub altogether, or adopting a restrictive license. Organizations can also exclude content through GitHub’s content exclusion settings.
11. Are my public GitHub repositories in The Stack v2 dataset for AI?
Many popular LLMs, including some ChatGPT variants, train on The Stack v2, a huge code dataset. You can check your repositories at this link. If included, submit an opt-out request at this GitHub issue form to remove your repos from future data releases.
12. Will using robots.txt or “Block AI training” on Substack fully stop AI bots?
Robots.txt and Substack’s “Block AI training” feature are good first steps but not guarantees. Some bots ignore robots.txt and continue crawling. If you need stronger protection, consider private or restricted-access methods (passwords, paywalls, etc.). Substack’s block setting updates a public rule but cannot force third-party bots to comply.
Keywords
Continue Reading:
Understanding Common Crawl: The Internet's Archive
Deep dive into Common Crawl, its role in AI training, and implications for SEO and...
Where Do LLMs Learn From: Training Data Analysis
A deep dive into the training data sources that power large language models and AI...
The Growing Impact of AI Search: Market Analysis
Analysis of AI search market size, key players, and growth trends across different search platforms...