Rebel Tech Newsletter No. 64: Avoid Unpaid AI Training | New School vs. Old School


June 11, 2024

The Rebel Tech Newsletter is our safe place to critique data and tech algorithms, processes, and systems. We highlight a recent data article in the news and share resources to help you dig deeper in understand how our digital world operates. DataedX Group helps data educators, scholars and practitioners learn how to make responsible data connections. We help you source remedies and interventions based on the needs of your team or organization.


IN DATA NEWS

No Robots(.txt): How to Ask ChatGPT and Google Bard to Not Use Your Website for Training.

You can prevent ChatGPT and Bard from using the content on your website to train their models. Using your website's "robots.txt." file, you can instruct bots and web crawlers NOT to scrape content.

While it may appear that AI dominates every industry and sector, public sentiment around AI remains increasingly skeptical. We’ve been witnesses to its lies, e.g., hallucinations, generic knowledge e.g., regurgitated output and biased responses e.g., whitewashed content.

Every new AI release unveils yet-another countermeasure. And this arsenal keeps expanding with approaches that block the AI’s access to data. Today, an oldie-but-goodie approach is highlighted with a recap of recent and human-based interventions. Stacking these old, new and forever school approaches help to curb runaway AI.

Old School

The robots.txt file remains a stable since the beginning of the internet. It has been a constant line of defense to limit access to websites’ webpages and/or directory of webpages. The simple use of providing instructions for web crawlers — and now generative AI algorithms — should be a universal stopgap. It’s not. First, adding disallow instructions to this file contradicts websites’ purpose, which is to be discovered. So the incentive to update the robots.txt file remains low. And second, following this file’s instructions aren’t required or enforceable.

The resilience of the robots.txt file comes from the web crawling politeness policies. The most relevant one is any form of web scraping tools respecting the allow and disallow instructions. So what does that mean for you? If you host a website, revise your robots.txt file to disallow bots from continuing to access, read and ingest your webpage content. Companies adhere to politeness policies since that bad actor label minimizes their reputation and profits. Follow the step-by-step guide given in the article below.

New School (review)

There’s a swell of checks-and-balances methods performed as part of AI systems. Recently, watermarking in AI has gained renewed attention. It amounts to adding a hard-to-remove digital tracker to an algorithm – it’s a small piece of code that keeps a running log of how the algorithm was manipulated. Digital watermarking can help distinguish between what’s AI generated, AI assisted or AI enabled. It can help indicate what’s perceived as digitally true. Automated content moderation algorithms attempts to identify blatant inappropriate content. It also can help combat AI-enabled fraud, misinformation and disinformation, when executed responsibly. Otherwise, content moderation dissolves into algorithmic misogynoir.

Forever School (review)

The human eye — with our critical thinking skills — remains one of the best stopgap measures to check AI’s output. Here’s a suggested shortcut to help you more quickly vet responses.

  1. Attempt to validate 3-4 responses by looking for credible references. Credible references vary by industry and sector so it may be a thought leader, a scholarly article, a well-respected industry report or some other form of externally vetted documentation, for example. Allocate only 30-45 minutes.
  2. If these credible responses are consistent, then you can be confident that you’ve vetted the response as well as you could and it seems to be trustworthy.
  3. When the responses are inconsistent, it’s time to stop your solo validation process. Before you become too frustrated and ready to walk away, it’s highly recommended that you reach out to a trusted colleague to ask for their perspective and show them your research. Most folks will reach out to ask but not share what content they found, where they found it or how they arrived at their conclusions. You’re showing and discussing your research so that you both can learn to weed out bad intel easier and sus out good information better. Allocate only 30-45 minutes.
  4. Remember: you’ll never be 100% right all the time. You do the best investigative work you can do within reasonable time limits. If you’re wrong, then own it, be curious as to why and learn from the mistake. With practice you’ll get better at vetting responses, identifying mis/disinformation and factual information. It takes time to understand how to rely on your own knowledge base.

Like what you're reading? Find it informative and insightful? You can sponsor the Rebel Tech Newsletter and follow us on LinkedIn.


"People believe that they won’t be able to learn to code, that it’ll take a long time to learn the skill well or they have to be a math prodigy to understand and apply coding concepts. In reality, you don’t have to be “super smart,” but you must be persistent." pg 87


​HAPPENINGS & APPEARANCES

  • [BWD BY DATAEDX GROUP] Come hear Monique Mills, TPM Focus CEO, share how to build sellable assets or receive tips on becoming an effective intraprenuer from Dr. Kenya Oduor, Lean Geeks CEO. A few days left to serve your Black Women in Data Summit 2024 early bird general admission ($599) or VIP/luxe admission ticket ($2499) at https://www.blackwomenindata.com/. On June 16 at 12:00AM EDT, BWD registration investment goes up to $849 and $3000, respectively. Support BWD in other ways by completing becoming a sponsor/community partner form for organizations. As an individual, you can also support a BWD who’s financially unable to attend, the gift of a registration: give now. We’re able to cover 2 BWDs in-need requests – assist us in covering the other 4!
  • [SETTING YOUR PRICES] Moving from employee pay rate to entrepreneurial pay rate is possible. There’s value-based scale you overlook as one who’s analytically-minded. That time-for-money swap you’re accustomed to is easier to math, but it minimizes our spectrum of contributions. I’m ready to share more. It’ll require a half-day or full-day immersive virtual workshop. Reply to this mail with ‘SETITOFF’ and deets will be shared if there’s enough interest.
  • [PRE-ORDER NOW!] Dr. Carlotta A. Berry and I have been co-editing a McGraw-Hill textbook, Mitigating Bias in Machine Learning, that highlights the scholarly work of 25 women and/or historically-excluded researchers. It’s a practical guide that shows, step by step, how to use machine learning to carry out actionable decisions that do not discriminate based on numerous human factors, including ethnicity and gender. The authors examine the many kinds of bias that occur in the field today and provide mitigation strategies that are ready to deploy across a wide range of technologies, applications, and industries. The book will be available September, 13, 2024. Order your copy at Barnes & Noble.


LAUGHING IS GOOD FOR THE SOUL

Stay Rebel Techie,

Dr. Brandeis

Thanks for subscribing! If you like what you read or use it as a resource, please share the newsletter signup with three friends!

Brandeis Marshall - DataedX

Learn how to make more responsible data connections. I help educators, researchers and practitioners align data polices, practices and products for equity. Sign up for my Rebel Tech Newsletter!

Read more from Brandeis Marshall - DataedX

June 25, 2024 The Rebel Tech Newsletter is our safe place to critique data and tech algorithms, processes, and systems. We highlight a recent data article in the news and share resources to help you dig deeper in understand how our digital world operates. DataedX Group helps data educators, scholars and practitioners learn how to make responsible data connections. We help you source remedies and interventions based on the needs of your team or organization. IN DATA NEWS The impact of...

April 30, 2024 The Rebel Tech Newsletter is our safe place to critique data and tech algorithms, processes, and systems. We highlight a recent data article in the news and share resources to help you dig deeper in understand how our digital world operates. DataedX Group helps data educators, scholars and practitioners learn how to make responsible data connections. We help you source remedies and interventions based on the needs of your team or organization. IN DATA NEWS Introducing Devin,...

February 20th, 2024 The Rebel Tech Newsletter is our safe place to critique data and tech algorithms, processes, and systems. We highlight a recent data article in the news and share resources to help you dig deeper in understand how our digital world operates. DataedX Group helps data educators, scholars and practitioners learn how to make responsible data connections. We help you source remedies and interventions based on the needs of your team or organization. IN DATA NEWS “Don’t let...