Unlocking Insights in Time Series Data: The Generative AI Challenge
At Global Predictions, we help self-directed investors navigate the complex world of global finance. Our flagship product, PortfolioPilot, uses big data to connect an investor's portfolio to the macro economy and make personalized recommendations to improve their net worth. To further help to decode the mysteries of the world around us, we launched Insights, which combs through millions of macroeconomic and financial data series to automatically generate insights that impact our users' portfolios.
We believe that by harnessing the power of machine learning we can deliver personalized, relevant, and downright fascinating macroeconomic and financial information to self-directed investors that traditionally has only been available to large institutions and hedge funds.
What makes our approach unique is that we are delivering unbiased insights and in the case of our experimental Twitter handle, @GPInvestBot, infusing a snarky tone. To accomplish this, we've turned to the latest breakthroughs in natural language processing and large language models (LLMs). Using these state-of-the-art technologies, we're now able to generate chart summaries that are not only data-driven and scalable but also incredibly varied and interesting.
Of course, building such a sophisticated system is no easy feat. We've encountered our fair share of challenges along the way. But we're committed to pushing the boundaries of what's possible in AI, and that's why we've launched a community-wide challenge to help us explore this exciting new frontier.
A Naïve Example
To demonstrate the current limitations of a LLMs ability to summarize time series data as text, let us show a quick (and clearly naïve) example. We built a prompt that fed in the last 100 data points of a macroeconomic series - US car registration growth – and asked ChatGPT to tell us what had been going on with it recently. Example prompt below:
Write a chart headline about the following series in less than 250 characters. The chart headline should focus on the latest datapoint in the context of its history and trends and highlight why they might be important to the global macroeconomy and financial markets.
The data is United States Car Registrations Growth YoY, Ann., through Jan, released on Feb 2 by U.S. Bureau of Economic Analysis (BEA) and the series is:
For reference, a chart version of the data fed in looks like this:
The text results looked reasonable at first glance but had a lot of problems under the hood. One indicative result was the following:
US Car Registrations YoY: Jan 2021 at +1.10%, Recovering to Highest Level Since 2019
Notice that several things are quite wrong. For one, the data was never +1.10% percent in Jan 2021 – in fact, growth in the series year-over-year was negative. Even if we start in 2021 looking back, and assume that Jan 2021 was indeed +1.10%, there are higher values not in 2019 (for example, the turn of the year in 2014-2015 saw data above 7% for two months).
Still, there is some good to be had! If you can imagine this chart headline sitting in between many clinically titled charts, this description is readable, makes an attempt to explain why it is interesting, and provides crucial language variation.
Taking the good without the bad
We’d love to see experimentation with this simple format to find best practices for “time series to text” generative descriptions. At Global Predictions we have at least one solid use-case for this sort of utility, but undoubtedly there are many more across a variety of fields too!
There are many areas of potential improvement to explore. To start off with, one might consider:
- Passing statistical information about the time series rather than the raw time series itself
- Writing a better prompt (this is a burgeoning field of research)
- Fine-tuning the LLM
- Using alternative large language models
- Generating several results and picking the best one based on some set of validation criteria
- Injecting more qualitative context about the series (for instance, is it a good or bad thing if a series goes up?) into the headline
- Playing around with tone
- Other ideas we have not thought of!
So if you're a data whiz, a language expert, or just someone with a passion for cutting-edge tech, we invite you to join us. Together, we can unlock the full potential of large language models and revolutionize the way we understand and interact with financial data.
The Global Predictions Generative AI Challenge
We thought it would be fun and informative to engage the community in a contest, allowing everyone to experiment with tackling this sort of problem and share learnings publicly. The first Global Predictions Generative AI Challenge, starts [insert date]. The challenge will be running through the end of the month and the winner receives $1,000.
The goal is to summarize macroeconomic and/or financial time series data in chart headline / tweet format automatically. The tweet should link to a public repo where we can find the source of the code.
We have some sample data series posted in GitHub, labeled sample_series_1.csv, sample_series_2.csv, etc. The format is a two-columns CSV with the first column labeled “Date” and the second containing a description of the data as a header (which will be different for each series), then dates and float values underneath each. You can use these to build and test your process / algorithm. You can also submit posts to see how well they do ahead of the contest and start using these series, but they won’t “count” towards the winner and the prize money.
Each week during the contest, we will post another data series (labeled contest_series_1.csv, contest_series_2.csv), that will be linked by the @GPInvestBot twitter. Contestants (or anyone who wants to participate) can download the data and tweet back their automatically generated description per the submission guidelines below!
- Follow @GPInvestBot
- @GPInvestBot will tweet a chart with the hashtag #GenAIChallenge. To submit, your tweet must include:
- Your automatically generated chart headline
- A screenshot of your code running
- A link to a public GitHub repo containing your code
- @GPInvestBot and #GenAIChallenge
- One submission per person per data series, please
- That’s it!
- User must be following @GPInvestBot account
- Headlines must be about the given time series. You can use outside data if you would like, but it must be freely available
- Headlines must be generated by a script or prompt only (we will verify this). You cannot manually edit it
- Do not expose any public API keys! (You can share with us privately if there is some key we need to use to run your script, but this rule is to protect you). You can assume we have an OpenAI key we can use without issue
- You must leave up your public GitHub repo throughout the course of the contest
- Headlines must be <= 250 characters. Tweet can be marginally longer because it will include link to a public GitHub repo, plus @GPInvestbot and the #GenAIChallenge hashtag
- Headlines can be snarky, funny, etc., but please nothing offensive. We will determine what that means and disqualify any offensive material at our discretion
How to Win:
- New data series will be posted on Monday’s 4pm PT on the @GPInvestBot Twitter account.
- The winner will be the person whose tweet has received the most likes on Twitter from all of the weekly contests that they submit! These will be added up across datasets – rewarding processes/algorithms that are robust and consistent. We are not standing in the way to judge; however, we will validate any potentially winners by running their code and evaluating whether it could have produced said result without manual interference
- Winner will be announced on the @GPInvestBot twitter, alongside a link to the public repo, and will receive $1K prize