Static News

voytec|next|[-][88 more]

I agree in general but the web was already polluted by Google's unwritten SEO rules. Single-sentence paragraphs, multiple keyword repetitions and focus on "indexability" instead of readability, made the web a less than ideal source for such analysis long before LLMs.

It also made the web a less than ideal source for training. And yet LLMs were still fed articles written for Googlebot, not humans. ML/LLM is the second iteration of writing pollution. The first was humans writing for corporate bots, not other humans.

bondarchuk|parent|prev|next|[-][1 more]

At some point though you have to acknowledge that a specific use of language belongs to the medium through which you're counting word frequencies. There are also specific writing styles (including sentence/paragraph sizes, unnecessary repetitions, focusing on other metrics than readability) associated with newspapers, novels, e-mails to your boss, anything really. As long as text was written by a human who was counting on at least some remote possibility that another human might read it, this is way more legitimate use of language than just generating it with a machine.

sahmeepee|parent|prev|next|[-][1 more]

Prior to Google we had Altavista and in those days it was incredibly common to find keywords spammed hundreds of times in white text on a white background in the footer of a page. SEO spam is not new, it's just different.

rockskon|parent|prev|next|[-][1 more]

Don't forget Google's adsense rules which penalized useful straightforward websites and mandated websites be full of "content". Doesn't matter if the "content" is garbage nonsense rambling and excessive word use - it's content and much more likely to be okayed by adsense!

redbell|parent|prev|next|[-][1 more]

> ML/LLM is the second iteration of writing pollution. The first was humans writing for corporate bots, not other humans.

Based on the process above, naturally, the third iteration then is LLMs writing for corporate bots, neither for humans nor for other LLMs.

pphysch|parent|prev|next|[-][1 more]

It's crazy to attribute the downfall of the web/search to Google. What does Google have to do with all the genuine open web content, Google's source of wealth, getting starved by (increasingly) walled gardens like Facebook, Reddit, Discord?

I don't see how Google's SEO rules being written or unwritten has any bearing. Spammers will always find a way.

rgrieselhuber|parent|prev|next|[-][2 more]

Indexability is orthogonal to readability.

hk__2|root|parent|next|[-][1 more]

It should be, but sadly it’s not.

krelian|parent|prev|next|[-][2 more]

>And yet LLMs were still fed articles written for Googlebot, not humans.

How do we know what content LLMs were fed? Isn't that a highly guarded secret?

Won't the quality of the content be paramount to the quality of the generated output or does it not work that way?

GTP|root|parent|next|[-][1 more]

We do know that the open web consitutes the bulk of the trainig data, although we don't get to know the specific webpages that got used. Plus some more selected sources, like books, of which again we only know that those are books but not which books were used. So it's just a matter of probability that there was a good amount of SEO spam as well.

jgrahamc|prev|next|[-][42 more]

I created https://lowbackgroundsteel.ai/ in 2023 as a place to gather references to unpolluted datasets. I'll add wordfreq. Please submit stuff to the Tumblr.

LeoPanthera|parent|next|[-][2 more]

Congratulations on "shipping", I've had a background task to create pretty much exactly this site for a while. What is your cutoff date? I made this handy list, in research for mine:

  2017: Invention of transformer architecture
  June 2018: GPT-1
  February 2019: GPT-2
  June 2020: GPT-3
  March 2022: GPT-3.5
  November 2022: ChatGPT

You may want to add kiwix archives from before whatever date you choose. You can find them on the Internet Archive, and they're available for Wikipedia, Stack Overflow, Wikisource, Wikibooks, and various other wikis.

jgrahamc|root|parent|next|[-][1 more]

I was taking "Release of ChatGPT" as the Trinity date.

astennumero|parent|prev|next|[-][3 more]

That's exactly the opposite of what the author wanted IMO. The author no more wants to be a part of this mess. Aggregating these sources would just makes it so much more easier for the tech giants to scrape more data.

iak8god|root|parent|next|[-][1 more]

The main concerns expressed in Robyn's note, as I read them, seem to be 1) generative AI has polluted the web with text that was not written by humans, and so it is no longer feasible to produce reliable word frequency data that reflects how humans use natural language; and 2) simultaneously, sources of natural language text that were previously accessible to researchers are now less accessible because the owners of that content don't want it used by others to create AI models without their permission. A third concern seems to be that support for and practice of any other NLP approaches is vanishing.

Making resources like wordfreq more visible won't exacerbate any of these concerns.

The sources are just aggregated. The source doesn't change.

The new stuff generated does (and this is honestly already captured).

This author doesn't generate content. They analyze data from humans. That "from humans" is the part that can't be discerned enough and thus the project can't continue.

Their research and projects are great.

Der_Einzige|parent|prev|next|[-][2 more]

FYI: My two datasets, DebateSum and OpenDebateEvidence/OpenCaseList in their current forms qualify for this, as they end at latest in 2022.

jgrahamc|root|parent|next|[-][1 more]

You can either add them to the site yourself via Tumblr or send them to me via email (jgc@cloudflare).

imhoguy|parent|prev|next|[-][5 more]

I am not sure we should trust a site contaminated by AI graphics. /s

gorkish|root|parent|next|[-][1 more]

The buildings and shipping containers that store low background steel aren't built out of the stuff either.

Yeah pay an illustrator if this is important to you.

See a lot of people upset about AI still using AI image generation because it's not in their field so they feel less strongly about it and can't create art themselves anyway, hypocritical either use it or don't but don't fuss over it then use it for something thats convenient for you.

imhoguy|root|parent|next|[-][2 more]

I have updated my comment with "/s" as that is closer to what I've meant. However, seriously, from ethical point of view it is unlikely illustrators were asked or compensated for their work being used for training AI to produce the image.

heckelson|root|parent|next|[-][1 more]

I thought the header image was a symbol of AI slop contamination because it looked really off-putting

ClassyJacket|parent|prev|next|[-][3 more]

:'( I thought I was clever for realising this parallel myself! Guess it's more obvious than I thought.

Another example is how data on humans after 2020 or so can't be separated by sex because gender activists fought to stop recording sex in statistics on crime, medicine, etc.

thebruce87m|root|parent|next|[-][1 more]

I too realised this parallel and frequently tell people about it.

Edit: just the first one

This is a psychotic thing to say without a source, considering how it's blatantly untrue.

cdrini|prev|next|[-][1 more]

This has to be the most annoying hacker news comment section I've ever seen. It's just the same ~4 viewpoints rehashed again, and again, and again. Why don't folks just upvote other comments that say the same thing instead of repeating the same things?

And now a hopefully new comment: having a word frequency measure of the internet as we're going into AI being more used would be IMMENSELY useful specifically _because_ more of the internet is being AI generated! I could see such a dataset being immensely useful to researchers who are looking for the impacts of AI on language, and to test empirically a lot of claims the author has made in this very post! What a shame that they stopped measuring.

Also: as to the claims that AI will cause stagnation and a reduction of the variance of English vocabulary used, this is a trend in English that's been happening for over 100 years ( https://evoandproud.blogspot.com/2019/09/why-is-vocabulary-s... ). I believe the opposite will happen, AI will increase the average persons vocabulary, since chat AIs tends to be more professionally written than a lot of the internet. It's like being able to chat with someone that has an infinite vocabulary. It also makes it possible for people to read complicated documents well out of their domain, since they can ask not just for definitions but more in depth explanations of what words/sections mean.

Here's to comment that will never be read because of all the noise in this thread :/

jll29|prev|next|[-][43 more]

0xbadcafebee|prev|next|[-][22 more]

weinzierl|prev|next|[-][37 more]

"I don't think anyone has reliable information about post-2021 language usage by humans."

We've been past the tipping point when it comes to text for some time, but for video I feel we are living through the watershed moment right now.

Especially smaller children don't have a good intuition on what is real and what is not. When I get asked if the person in a video is real, I still feel pretty confident to answer but I get less and less confident every day.

The technology is certainly there, but the majority of video content is still not affected by it. I expect this to change very soon.

frognumber|parent|next|[-][9 more]

There are a series of challenges like:

https://www.nytimes.com/interactive/2024/09/09/technology/ai...

https://www.nytimes.com/interactive/2024/01/19/technology/ar...

These are a little bit unfair, in that we're comparing handpicked examples, but I don't think many experts will pass a test like this. Technology only moves forward (and seemingly, at an accelerating pace).

What's a little shocking to me is the speed of progress. Humanity is almost 3 million years old. Homosapiens are around 300,000 years old. Cities, agriculture, and civilization is around 10,000. Metal is around 4000. Industrial revolution is 500. Democracy? 200. Computation? 50-100.

The revolutions shorten in time, seemingly exponentially.

Comparing the world of today to that of my childhood....

One revolution I'm still coming to grips with is automated manufacturing. Going on aliexpress, so much stuff is basically free. I bought a 5-port 120W (total) charger for less than 2 minutes of my time. It literally took less time to find it than to earn the money to buy it.

I'm not quite sure where this is all headed.

homebrewer|root|parent|next|[-][1 more]

> so much stuff is basically free

It really isn't. Have a look at daily median income statistics for the rest of the planet:

https://ourworldindata.org/grapher/daily-median-income?tab=t...

  $2.48 Eastern and Southern Africa (PIP)
  $2.78 Sub-Saharan Africa (PIP)
  $3.22 Western and Central Africa (PIP)
  $3.72 India (rural)
  $4.22 South Asia (PIP)
  $4.60 India (urban)
  $5.40 Indonesia (rural)
  $6.54 Indonesia (urban)
  $7.50 Middle East and North Africa (PIP)
  $8.05 China (rural)
  $10.00 East Asia and Pacific (PIP)
  $11.60 Latin America and the Caribbean (PIP)
  $12.52 China (urban)

And more generally:

  $7.75 World

I looked around on Ali, and the cheapest charger that doesn't look too dangerous costs around five bucks. So it's roughly equal to one day's income of at least half the population of our planet.

Democracy (and Republics) are thousands of year old. Computation is also quite old though it only sky-rocketed with electricity and semiconductors. This is not the first time the global world created a potential for exponential growth (I'll consider the Pharaohs and Roman empires to be ones).

There is the very real possibility that everything just stalls and plateau where we are at. You know, like our population growth, it should have gone exponentially but it did not. Actually, quite the reverse.

+100w chargers are one of the products I prefer to spend a little more on, so I get something from a company that knows it can be sued if they make a product that burns down your house or fries your phone.

Flashlights? Sure, bring on aliexpress. USB cables with pop-off magnetically attached heads, no problem. But power supplies? Welp, to each their own!

fph|root|parent|next|[-][2 more]

And then you plug your cheap pop-off USB cable into the expensive 100w charger?

knodi123|root|parent|next|[-][1 more]

Yeah, sure, what could possibly go wrong? :-P

But seriously, it's harder to accidentally make a USB cable that fries your equipment. The more common failure mode is it fails to work, or wears out too fast. Chargers on the other hand, handle a lot of voltage, generate a lot of heat, and output to sensitive equipment. More room to mess up, and more room for mistakes to cause damage.

> One revolution I'm still coming to grips with is automated manufacturing. Going on aliexpress, so much stuff is basically free. I bought a 5-port 120W (total) charger for less than 2 minutes of my time. It literally took less time to find it than to earn the money to buy it.

Is there a big recent qualitative change here? Or is this a continuation of manufacturing trends (also shocking, not trying to minimize it all, just curious if there’s some new manufacturing tech I wasn’t aware of).

For some reason, your comment got me thinking of a fully automated system, like: you go to a website, pick and choose charger capabilities (ports, does it have a battery, that sort of stuff). Then an automated factor makes you a bespoke device (software picks an appropriate shell, regulators, etc). I bet we’ll see it in our lifetimes at least.

> "The revolutions shorten in time, seemingly exponentially."

The Technological Singularity - https://en.wikipedia.org/wiki/Technological_singularity

Democracy is 200? You're off by a full order of magnitude.

Progress isn't inevitable. It's possible for knowledge to be lost and for civilization to regress.

apricot|parent|prev|next|[-][2 more]

> When I get asked if the person in a video is real, I still feel pretty confident to answer

I don't. I mean, I can identify the bad ones, sure, but how do I know I'm not getting fooled by the good ones?

weinzierl|root|parent|next|[-][1 more]

That is very true, but for now we have a baseline of videos that we either remember or that we remember key details of, like the persons in the video. I'm pretty sure if I watch The Primeagen or Tom Scott today, that they are real. Ask me in year, I might not be so sure anymore.

bsder|parent|prev|next|[-][1 more]

> When I get asked if the person in a video is real, I still feel pretty confident to answer

I don't share your confidence in identifying real people anymore.

I often flag as "false-ish" a lot of things from genuinely real people, but who have adopted the behaviors of the TikTok/Insta/YouTube creator. Hell, my beard is grey and even I poked fun at "YouTube Thumbnail Face" back in 2020 in a video talk I gave. AI twigs into these "semi-human" behavioral patterns super fast and super hard.

There is a video floating around with pairs of young ladies with "This is real"/"This is not real" on signs. They could be completely lying about both, and I really can't tell the difference. All of them have behavioral patterns that seems a little "off" but are consistent with the small number of "influencer" videos I have exposure to.

dweinus|prev|next|[-][9 more]

> Now the Web at large is full of slop generated by large language models, written by no one to communicate nothing.

Fair and accurate. In the best cases the person running the model didn't write this stuff and word salad doesn't communicate whatever they meant to say. In many cases though, content is simply pumped out for SEO with no intention of being valuable to anyone.

andrethegiant|parent|prev|next|[-][1 more]

That sentence stood out to me too, very powerful. Felt it right in the feels.

dsign|prev|next|[-][6 more]

Somehow related, paper books from before 2020 could be a valuable commodity in a in a decade or two, when the Internet will be full of slop and even contemporary paper books will be treated with suspicion. And there will be human talking heads posing as the authors of books written by very smart AIs. God, why are we doing this????

rvnx|parent|next|[-][1 more]

To support well-known “philanthropists” like Sam Altman or Mark Zuckerberg that many consider as their heroes here.

user432678|parent|prev|next|[-][2 more]

And I thought I had some kind of mental illness collecting all those books, barely reading them. Need to do that more now.

globular-toast|root|parent|next|[-][1 more]

Yes. I've always loved my books but now consider them my most valuable possessions.

RomanAlexander|parent|prev|next|[-][1 more]

Or AI talking heads posing as the author of books written by AIs. https://youtu.be/pAPGRGTqIgI (warning: state sponsored disinformation AI)

aucisson_masque|prev|next|[-][15 more]

baq|prev|next|[-][14 more]

jgord|prev|next|[-][2 more]

We will soon face another kind of bit-rot : where so much text is generated by LLMs that it pollutes the human natural language corpus available for training, on the web.

Maybe we actually need to preserve all the old movies / documentaries / books in all languages and mark them as pre-LLM / non-LLM.

But I hazard a guess this wont happen, as its a common good that could only be funded by left-leaning taxation policies - no one can make money doing this, unlike burning carbon chains to power LLMs.

ipaddr|parent|next|[-][1 more]

Old content can make money now and will be more valuable why wouldn't it happen more frequency?

bane|prev|next|[-][13 more]

aryonoco|prev|next|[-][4 more]

I feel so conflicted about this.

On the one hand, I completely agree with Robyn Speer. The open web is dead, and the web is in a really sad state. The other day I decided to publish my personal blog on gopher. Just cause, there's a lot less crap on gopher (and no, gopher is not the answer).

But...

A couple of weeks ago, I had to send a video file to my wife's grandfather, who is 97, lives in another country, and doesn't use computers or mobile phones. Eventually we determined that he has a DVD player, so I turned to x264 to convert this modern 4K HDR video into a form that can be played by any ancient DVD player, while preserving as much visual fidelity as possible.

The thing about x264 is, it doesn't have any docs. Unlike x265 which had a corporate sponsor who could spend money on writing proper docs, x264 was basically developed through trial and error by members of the doom9 forum. There are hundreds of obscure flags, some of which now operate differently to what they did 20 years ago. I could spend hours going through dozens of 20 year old threads on doom9 to figure out what each flag did, or I could do what I did and ask a LLM (in this case Claude).

Claude wasn't perfect. It mixed up a few ffmpeg flags with x264 ones (easy mistake), but combined with some old fashioned searching and some trial and error, I could get the job done in about half an hour. I was quite happy with the quality of the end product, and the video did play on that very old DVD player.

Back in pre-LLM days, it's not like I would have hired a x264 expert to do this job for me. I would have either had to spend hours more on this task, or more likely, this 97 year old man would never have seen his great granddaughter's dance, which apparently brought a massive smile to his face.

Like everything before them, LLMs are just tools. Neither inherently good nor bad. It's what we do with them and how we use them that matters.

sangnoir|parent|next|[-][3 more]

> Back in pre-LLM days, it's not like I would have hired a x264 expert to do this job for me. I would have either had to spend hours more on this task, or more likely, this 97 year old man would never have seen his great granddaughter's dance

Didn't most DVD burning software include video transcoding as a standard feature? Back in the day, you'd have used Nero Burning ROM, or Handbrake - granted, the quality may not have been optimized to your standards, but the result would have been a watchable video (especially to 97 year-old eyes)

aryonoco|root|parent|next|[-][2 more]

Back in the day they did. I checked handbrake but now there's nothing specific about DVD compatibility there. I could have picked something like Super HQ 576p, and there's a good chance that would have sufficed, but old DVD players were extremely finicky about filenames, extensions, interlacing, etc. I didn't want to risk the DVD traveling half way across the world only to find that it's not playable.

sangnoir|root|parent|next|[-][1 more]

I mentioned Handbrake without checking its DVD authoring capability - probably used it to rip DVDs many years ago and got it mixed up with burning them; a better FLOSS alternative for authoring would have been DeVeDe or bombono.

oneeyedpigeon|prev|next|[-][8 more]

I wonder if anyone will fork the project. Apart from anything else, the data may still be useful given that we know it is polluted. In fact, it could act as a means of judging the impact of LLMs via that very pollution.

Miraltar|parent|next|[-][7 more]

I guess it would be interesting but differentiating pollution from language evolution seems very tricky since getting a non polluted corpus gets harder and harder

Retr0id|root|parent|next|[-][5 more]

Arguably it is a form of language evolution. I bet humans have started using "delve" more too, on average. I think the best we can do is look at the trends and think about potential causes.

rvnx|root|parent|next|[-][3 more]

“Seamless”, “honed”, “unparalleled”, “delve” are now polluting the landscape because of monkeys repeating what ChatGPT says without even questioning what the words mean.

Everything is “seamless” nowadays. Like I am seamlessly commenting here.

Arguably, the meaning of these words evolve due to misuse too.

oneeyedpigeon|root|parent|next|[-][1 more]

I see a lot of writing in my day-to-day, and the words that stick out most are things like "plethora" and "utilized". They're not terribly obscure, but they're just 'odd' and, maybe, formal enough to really stick out when overused.

Btw can’t people just open their prompts by instructing LLMs not to use those words?

> I bet humans have started using "delve" more too, on average.

I wish there were a way to check.

One way to tackle it would be to use LLMs to generate synthetic corpuses, so you have some good fingerprints for pollution. But even there I'm not sure how doable that is given the speed at which LLMs are being updated. Even if I know a particular page was created in, say, January 2023, I may no longer be able to try to generate something similar now to see how suspect it is, because the precise setups of the moment may no longer be available.

miguno|prev|next|[-][2 more]

I have been noticing this trend increasingly myself. It's getting more and more difficult to use tools like Google search to find relevant content.

Many of my searches nowadays include suffixes like "site:reddit.com" (or similar havens of, hopefully, still mostly human-generated content) to produce reasonably useful results. There's so much spam pollution by sites like Medium.com that it's disheartening. It feels as if the Internet humanity is already on the retreat into their last comely homes, which are more closed than open to the outside.

On the positive side:

1. Self-managed blogs (like: not on Substack or Medium) by individuals have become a strong indicator for interesting content. If the blog runs on Hugo, Zola, Astro, you-name-it, there's hope.

2. As a result of (1), I have started to use an RSS reader again. Who would have thought!

I am still torn about what to make of Discord. On the one hand, the closed-by-design nature of the thousands of Discord servers, where content is locked in forever without a chance of being indexed by a search engine, has many downsides in my opinion. On the other hand, the servers I do frequent are populated by humans, not content-generating bots camouflaged as users.

nlpparty|parent|next|[-][1 more]

It has been for me the last 15 years like this.

jchook|prev|next|[-][2 more]

If it is (apparently) easy for humans to tell when content is AI-generated slop, then it should be possible to develop an AI to distinguish human-created content.

As mentioned, we have heuristics like frequency of the word "delve", and simple techniques such as measuring perplexity. I'd like to see a GAN style approach to this problem. It could potentially help improve the "humanness" of AI-generated content.

aDyslecticCrow|parent|next|[-][1 more]

> If it is (apparently) easy for humans to tell when content is AI-generated slop

It's actually not. It's rather difficult for humans as well. We can see verbose text that is confused and call it AI, but it could just be a human aswell.

To borrow an older model training method, "Generative adversarial network". If we can distinguish AI from humans... We can use it to improve AI and close the gap.

So, it becomes an arms race that constantly evolves.

greentxt|prev|next|[-][5 more]

I think this person has too high a view of pre-2021, probably for ego reasons. In fact, their attitude seems very ego driven. AI didn't just occur in 2021. Nobody knows how much text was machine generated prior to 2021, it was much harder if not impossible to detect. If anything, it's probably easier now since people are all using the same ai that use words like delve so much much it becomes obvious.

croes|parent|next|[-][4 more]

>AI didn't just occur in 2021. Nobody knows how much text was machine generated prior to 2021

But we do know that now it's a lot more, with a big LOT.

greentxt|root|parent|next|[-][3 more]

I assume you are correct but how can we know rather than assume? I am not sure we can, so why get worked up about "internet died in 2021" when many would claim with similar conviction that it's been dead since 2012, or 2007, or ...

ClassyJacket|root|parent|next|[-][2 more]

You are making a claim that somehow someone was sitting on something as powerful as ChatGPT, long before ChatGPT, and that it was in widespread use, secretly, without even a single leak by anyone at any point. That's not plausible.

nlpparty|root|parent|next|[-][1 more]

Twitter has been accused of being full of bots long before ChatGPT appeared. For 140 symbols, a template with synonyms would be enough to create mass-generated content.

sashank_1509|prev|next|[-][2 more]

Not to be too dismissive, but is there a worthwhile direction of research to pursue that is not LLM’s in NLP?

If we add linguistics to NLP I can see an argument but if we define NLP as the research of enabling a computer process language then it seems to me that LLM’s/ Generative AI is the only research that an NLP practitioner should focus on and everything else is moot. Is there any other paradigm that we think can enable a computer understand language other than training a large deep learning model on a lot of data?

sinkasapa|parent|next|[-][1 more]

Maybe it is "including linguistics" but most of the world's languages don't have the data available to train on. So I think one major question for NLP is exactly the question you posed: "Is there any other paradigm that we think can enable a computer understand language other than training a large deep learning model on a lot of data?"

aucisson_masque|prev|next|[-][8 more]

It could be used to spot LLM generated text.

compare the frequency of words to those used in human natural writings and you spot the computer from the human.

Lvl999Noob|parent|next|[-][4 more]

It could be used to differentiate LLM text from pre-LLM human text maybe. The thing, our AIs may not be very good at learning but our brains are. The more we use AI, the more we integrate LLMs and other tools into our life, the more their output will influence us. I believe there was a study (or a few anecdotes) where college papers checked for AI material were marked AI written even though they were written by humans because the students used AI during their studying and learned from it.

MPSimmons|root|parent|next|[-][1 more]

You're exactly right. You only have to look at the prevalence of the word "unalive" in real life contexts to find an example.

>our AIs may not be very good at learning but our brains are

Brains aren't nearly as good at slightly adjusting the statistical properties of a text corpus as computers are.

> The more we use AI, the more we integrate LLMs and other tools into our life, the more their output will influence us

Hmm I don’t disagree but I think it will be valuable skill going forward to write text that doesn’t read like it was written by an LLM

This is an arms race that I’m not sure we can win though. It’s almost like a GAN.

ithkuil|parent|prev|next|[-][1 more]

it may work for a short time, but after a while natural language will evolve due to natural exposure of those new words or word patterns and even human will write in ways that, while being different from the LLMs, will also be different from the snapshot captured by this snapshot. It's already the case that we used to write differently 20 years ago from 50 years ago and even more so 100 years ago, etc

slashdave|parent|prev|next|[-][1 more]

Hardly. You are talking about a statistical test, which will have rather large errors (since it is based on word frequencies). Not to mention word frequencies will vary depending on the type of text (essay, description, advertisement, etc).

TacticalCoder|parent|prev|next|[-][1 more]

> ... compare the frequency of words to those used in human natural writings and you spot the computer from the human.

But that's a losing endeavor: if you can do that, you can immediately ask your LLM to fix its output so that it passes that test (and many others). It can introduce typos, make small errors on purpose, and anything you can think of to make it look human.

karaterobot|prev|next|[-][2 more]

I guess a manageable, still-useful alternative would be to curate a whitelist of sources that don't use AI, and without making that list public, derive the word frequencies from only those sources. How to compile that list is left as an exercise for the reader. The result would not be as accurate as a broad sample of the web, but in a world where it's impossible to trust a broad sample of the web, it the option you are left with. And I have no reason to doubt that it could be done at a useful scale.

I'm sure this has occurred to them already. Apart from the near-impossibility of continuing the task in the same way they've always done it, it seems like the other reason they're not updating wordfreq is to stick a thumb in the eye of OpenAI and Google. While I appreciate the sentiment, I recognize that those corporations' eyes will never be sufficiently thumbed to satisfy anybody, so I would not let that anger change the course of my life's work, personally.

WaitWaitWha|parent|next|[-][1 more]

> curate a whitelist of sources that don't use AI,

I like this.

Maybe even take it a step further - have a badge on the source that is both human and machine visible to indicate that the content is not AI generated.

avazhi|prev|next|[-][3 more]

I agree with the general ethos of the piece (albeit a few of the details are puzzling and unnecessarily partisan - content on X isn't invariably worthless drivel, nor does what Reddit is doing make much intellectual as opposed to economic [IPO-influenced] sense - but this line:

'OpenAI and Google can collect their own damn data. I hope they have to pay a very high price for it, and I hope they're constantly cursing the mess that they made themselves.'

really does betray some real naivete. OpenAI and Google could literally burn $10million dollars per day (okay, maybe not OpenAI - but Google surely could) and reasonably fail to notice. Whatever costs those companies have to pay to collect training data will be well worth it to them. Any messes made in the course of obtaining that data will be dealt with by an army of employees either manually cleaning up the data, or by algorithms Google has its own LLM write for itself.

I do find the general sense of impending dystopian inhumanity arising out of the explosion of LLMs to be super fascinating (and completely understandable).

devjab|parent|next|[-][2 more]

> puzzling and unnecessarily partisan - content on X isn't invariably worthless drivel

Maybe this is because I’m European, but what is partisan about calling X invariably worthless drivel? Seems a lot like facts to me considering what has been going on with the platform moderation since Elon Musk bought it. It’s so bad that the EU consider it a platform for misinformation these days.

cdrini|root|parent|next|[-][1 more]

Do you have a citation on that last claim?

PeterStuer|prev|next|[-][10 more]

Intuitively I feel like word frequency would be one of the things least impacted by LLM output, no?

cdrini|parent|next|[-][1 more]

If only we had a data set that measured word frequency across the internet as we're getting more and more into AI being used... Maybe with a baseline from before 2021 for comparison... But no let's just stop measuring word frequency entirely because we can just assume what will happen and we're angry.

Jcampuzano2|parent|prev|next|[-][1 more]

It'd be in fact quite the opposite. There comes a turning point where the majority of language usage would actually be written by AI, at which point we'd no longer be analysing the word frequency/usage by actual humans and so it wouldn't be representative of how humans actually communicate.

Or potentially even more dystopian would be that AI slop would be dictating/driving human communication going forward.

baq|parent|prev|next|[-][5 more]

‘delve’ is given as an example right there in TFA.

PeterStuer|root|parent|next|[-][3 more]

Yes, but the material presented in no way makes distiction between potential organic growth of 'delve' vs. LLM induced use. They just note that even though 'delve' was on the rise, in 23-24 the word gains more popularity, at the same time ChatGPT rose. Word adoption is certainly not a linear phenomenon. And as the author states 'I don't think anyone has reliable information about post-2021 language usage by humans'

So I would still state noun-phrase frequency in LLM output would tend to reflect noun-phrase frequency in training data in a similar context (disregarding enforced bias induced through RLHF and other tuning at the moment)

I'm sure there will be cross-fertilization from LLM to Human and back, but I'm not seeing the data yet that the influence on word-frequency is that outspoken.

The author seems to have some other objections to the rise of LLM's, which I fully understand.

QuiDortDine|root|parent|next|[-][1 more]

The fact that making this distinction is impossible is reason enough to stop.

Even granting that we can disregard a really huge factor here, which I'm not sure we really can, one can not know beforehand how the clustering of the vocabulary is going to go pre-training, and its speculated that both at the center and at the edges of clusters we get random particularities. Hence the "solidgoldmagikarp" phenomenon and many others.

there is almost certainly organic growth as well as more people in Nigeria and other SSA countries are getting very good internet penetration in recent years

joshdavham|parent|prev|next|[-][2 more]

Think of an LLM as a person on the internet. Just like everyone else, they have their own vocabulary and preferred way of talking which means they’ll use some words more than others. Now imagine we duplicate this hypothetical person an incredible amount of times and have their clones chatter on the internet frequently. ‘Certainly’ this would have an effect.

efskap|root|parent|next|[-][1 more]

Yes but this person learned to mimic the internet at large. Theoretically its preferred way of talking would be the average of all training data, as mimicry is GPT's training objective, and would therefore have very similar word distributions. Only, this doesn't account for RLHF and prompts spreading memetically among users.

jadayesnaamsi|prev|next|[-][1 more]

The year 2021 is to wordfreq what 1945 was to carbon carbon-14 dating.

I guess the same way the scientists had to account for the bomb pulse in order to provide accurate carbon-14 dating, wordfreq would need a magic way to account for non human content.

Saying magic, because unfortunately it was much easier to detect nuclear testing in the atmosphere than to it will be to detect AI-generated content.

WalterBright|prev|next|[-][1 more]

I've wondered from time to time why I collect history books, keep my encyclopedias, when I could just google it. Now I know why. They predate AI and are unpolluted by generated bilge.

yarg|prev|next|[-][1 more]

Generative AI has done to human speech analysis what atmospheric testing did to carbon dating.

nlpparty|prev|next|[-][1 more]

It's just inevitable. Imagine a world where we get a cheap and accessable AGI. Most work in the world will be done by it. Certainly, it will organise the work the way it finds more preferable. Humans (and other AIs) will find it much harder to train from example as most of the work is performed in the same uniform way. The AI revolution should start with the field closest to its roots.

altcognito|prev|next|[-][3 more]

It might be fun to collect the same data if not for any other reason than to note the changes but adding the caveat that it doesn’t represent human output.

Might even change the tool name.

jpjoi|parent|next|[-][2 more]

The point was it’s getting harder and harder to do that as things get locked down or go behind a massive paywall to either profit off of or avoid being used in generative AI. The places where previous versions got data is impossible to gather from anymore so the dataset you would collect would be completely different, which (might) cause weird skewing.

oneeyedpigeon|root|parent|next|[-][1 more]

But that would always be the case. Twitter will not last forever; heck, it may not even be long before an open alternative like Bluesky competes with it. Would be interesting to know what percentage of the original mined data was from Twitter.

jhack|prev|next|[-][1 more]

Kind of weird to believe “slop” didn’t exist on the internet in mass quantities before AI.

nlpparty|prev|next|[-][3 more]

https://trends.google.com/trends/explore?date=all&geo=US&q=d...

The funny fact: It doesn't result in the increase for search results for "delve".

1d22a|parent|next|[-][2 more]

That chart shows people searching for the world delve, and isn't (directly) related to the incidence of words in content on the open web.

nlpparty|root|parent|next|[-][1 more]

I just assumed that if many people, especially not proficient language users encounter this word in the text generated by ChatGPT they would look it up.

jijojohnxx|prev|[-][1 more]

Sad to see wordfreq halted, it was a real party for linguistics enthusiasts. For those seeking new tools, keep expanding your knowledge with socialsignalai.