Ticker.history return messy tz info #208

ericwbzhang · 2023-07-19T00:55:28Z

Hi I notice the price history returns messy tz info when working with more than one ticker in different tz

Example:
tmp_ticker= yq.Ticker( ['CLK24.NYM', '6618.HK'])
a= tmp_ticker.history(interval='5m', start='2023-07-10').reset_index()

part of returns:

6618.HK,2023-07-13 15:45:00+08:00,55.50000,55.80000,55.50000,55.70000,147000.00000
6618.HK,2023-07-13 15:50:00+08:00,55.80000,55.90000,55.70000,55.75000,157050.00000
6618.HK,2023-07-13 15:55:00+08:00,55.80000,55.80000,55.55000,55.60000,86600.00000
6618.HK,2023-07-13 04:05:00-04:00,55.60000,55.70000,55.60000,55.70000,359100.00000
6618.HK,2023-07-14 09:30:00+08:00,55.70000,56.15000,55.35000,55.90000,0.00000
6618.HK,2023-07-14 09:35:00+08:00,55.70000,55.75000,55.10000,55.35000,80950.00000

pd.version =1.5.3
yq.version= 2.3.2

maread99 · 2023-07-19T10:15:26Z

Hi @ericwbzhang, thanks for raising this. The bug's here, in how the dataframes for each symbol are concatenated:

yahooquery/yahooquery/ticker.py

Lines 1324 to 1325 in cb5b2bd

    
           try: 
        
               df = pd.concat(d, names=["symbol", "date"], sort=False)

Rather than retain the timezone info of the labels being concatenated, it'll use the first label it comes across that represents the same timestamp, regardless of the timezone. In your example you'll see that the labels that seem out for the HK symbol are correct as timestamps, they are just represented in a different timezone to what would be expected. This happens wherever the label represents a timestamp that's already present in the data for the NYM symbol.

~~I'm going to offer a PR with a fix that will retain the expected timezones~~ (EDIT - no I'm not, see next comment).

It's probably worth noting that when you request data for symbols trading in different timezones AND have the adj_timezone as True (the default) the date level of the index will not be a pd.DatetimeIndex, but rather of dtype 'object'. This is because pd.DatetimeIndex requires all indices to have the same timezone. You might need to do quite a bit of post-processing. Depending on what you're doing with the data, you may find any of the following useful:

pass the adj_timezone parameter as False to return all indices in terms of UTC. The date level will then be a pd.DatetimeIndex which can be easily converted to another timezone.
have a look at the market_prices package - by default it uses yahooquery to get the price data although adds a host of options to query and post-process the data. For symbols trading on different exchanges it provides for combining all symbols in a dataframe indexed with a pd.DatetimeIndex (not a pd.MultiIndex) with unique labels in terms of your elected timezone.
(EDIT) separate out your ticker instances if it works for you to have the data in different dataframe, each with its own timezone, i.e. tick1 = yq.Ticker('CLK24.NYM'), tick2 = yq.Ticker('6618.HK').

maread99 · 2023-07-19T14:40:42Z

Right, this is annoying, I'm not able to create the required pd.MultiIndex. Tried various ways to no avail.

It might be bug in pandas (I've raised it with them here) but I think it's just as likely that it can't be done given how pd.MultiIndex work.

Assuming it's not a bug in pandas, options that occur to me are:

leave it as it is. All the labels are accurate, they just aren't all in the timezone that might be expected. (I'm not comfortable with this option given the potential for confusion.)
require that adj_timezone is explicitly passed as False whenever symbols correspond to more than one timezone (if adj_timezone is True in this case then raise an exception explaining why adj_timezone needs to be passed as False). This would force all indices to UTC and the user would know to expect this and why.

@dpguthrie, what do you reckon? I'm happy to offer up a PR for the second option if you agree that's the way to go.

As a related aside, I think the default for adj_timezone would be better as False, not least because it seems it's not possible to reliably represent more than one timezone in the returned dataframe. Although such a change would probably need to wait for any v3.

ericwbzhang · 2023-07-19T19:21:43Z

@maread99 thanks for looking into this issue. I am thinking maybe returning an extra column that states the tzinfo for each symbol if adj_timezone= True. The date col will always be utc datetime
How do you like this?

maread99 · 2023-07-19T21:32:33Z

Hi @ericwbzhang.

IMHO, and the idea I get from having followed the package's evolution, yahooquery is about providing an API to get data from Yahoo Finance and it should not be concerned with doing much more than a bare minimum of post-processing. Rather post-processing should be left to the client according to their needs (although I might be off-the-mark with this - @dpguthrie ?).

The extra column you suggest could be easily added by any user who would find it useful (a symbol's timezone is available through the API). I'm not sure if you're suggesting that this column should have dtype 'str' and show the timezone (e.g. "America/New_York") or have dtype 'object' and hold pd.Timestamp showing the row datetime in the timezone of the corresponding symbol. (The latter is certainly possible and more useful - you CAN have a column, dtype 'object', with pd.Timestamp that represent the same underlying timestamp in different timezones. Just appears you can't do this within a pd.MultiIndex.)

dpguthrie · 2023-07-19T21:38:39Z

Definitely agree that this package should be a wrapper around the API and return, for the most part, what the underlying data is and not do much more.
@maread99 Maybe we don't need a pd.MultiIndex - pretty sure people end up resetting the index anyway after getting the results.

ericwbzhang · 2023-07-19T21:59:42Z

Yeah agree with Doug. Simply retuning a df with column symbol and date would be great. Sent from my iPhoneOn Jul 19, 2023, at 17:38, Doug Guthrie ***@***.***> wrote: Definitely agree that this package should be a wrapper around the API and return, for the most part, what the underlying data is and not do much more. @maread99 Maybe we don't need a pd.MultiIndex - pretty sure people end up resetting the index anyway after getting the results. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

maread99 · 2023-07-19T22:49:28Z

Just to clarify, are we suggesting having the dataframe with index as the default integer index (first row 0) and adding a 'symbol' and a 'date' column, i.e. as currently retuned by df.reset_index()?

I think this would certainly resolve the issue. I'm happy to offer the PR and revise the doc and affected tests - let me know.

Only thing, I suspect the change would break pretty much every dependant that uses the history method. Maybe one for 2.4, even 3.0?

dpguthrie · 2023-07-22T20:20:53Z

@maread99 As always, appreciate the work and timely responses!

I've definitely gotten feedback in the past that the MultiIndex is somewhat undesirable and that most people do in fact reset_index.

It is definitely a breaking change. In my opinion, it's more of a minor breaking change (so upgrading to 2.4) instead of a major one (upgrading to 3.0). Would love to hear your opinion, and anyone else's, on what the appropriate code path and semver path should be.

ValueRaider · 2023-07-22T20:55:41Z

Definitely agree that this package should be a wrapper around the API and return, for the most part, what the underlying data is and not do much more.

Then why combine the tables at all? Just return a dict of individual tables.

It is definitely a breaking change.

My suggestion: put this change in a new major version like 3.0, but continue supporting 2.* for a good while. Let users migrate in their own time.

maread99 · 2023-07-24T09:04:54Z

Then why combine the tables at all? Just return a dict of individual tables.

I think this is a good point. At market-prices I just end up splitting the DataFrame back into individual tables. I suspect at least some users will be doing something similar. Seems to make sense to offer the rawer option and leave it to the user to decide what they want to do with it. This would also nicely solve the tz matter that started off this issue - there would be no room for conflict between 'same timestamp different timezone', rather if adj_timezone is True then each DataFrame could simply be indexed in terms of the symbol's local timezone. This would also allow the columns to cleanly represent OHLCV, i.e. as now and as I think users would most-likely expect (personally I'm not keen on polluting the columns with the likes of 'date' and 'symbol').

So there's a couple of options to resolve the tz conflicts - either as if the index were reset or offer the return as a dict[str, DataFrame] with each DataFrame representing the prices for a single symbol (as identified by the key). I'd elect the latter for reasons of being cleaner and rawer (and so potentially more performant). @dpguthrie, all that said, I'd be happy to resolve this by offering a PR for the route you'd prefer to go with.

Versioning

Whichever way this is resolved it's going to break the code of almost everyone who uses's the .history method. However, the fixes that users will be required to implement should be pretty minimal. Couple of thoughts:

Whilst the nature of the change might warrant v3, going to v3 for this alone seems a bit excessive - 'history' is after all just one method offered offered by yahooquery. Users who don't use it would see absolutely no changes despite a new major release.
Whilst going to v3 and temporarily maintaining a v2 branch might be the ideal approach to give users time to migrate, I think simply going to v2.4 could be more than justified given time restraints on owners / maintainers of open-source. Users could pin to the prior version until they get a chance to implement their fix. On this basis, might be worth considering resolving the cookies issue first, releasing as 2.4, then fix this and release as 2.5 so that users have the option to pin to a stable 2.4.
- Bundling the whole lot (cookies and this) as v3 might appear to be a clean option, although it would leave users without a stable prior version to pin to.
I think what's more important than 3.0 or 2.* is ensuring good documentation accompanies the changes / release so that users are aware there's been a breaking change and it's clear what that change is.

If it were me I'd probably go 2.4 then 2.5 - stable prior version to pin to and no responsibility to maintain a separate branch.

maread99 mentioned this issue Jul 19, 2023

BUG: in a level of MultiIndex tz info is lost from labels that represent the same timestamp pandas-dev/pandas#54194

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ticker.history return messy tz info #208

Ticker.history return messy tz info #208

ericwbzhang commented Jul 19, 2023 •

edited

Loading

maread99 commented Jul 19, 2023 •

edited

Loading

maread99 commented Jul 19, 2023

ericwbzhang commented Jul 19, 2023

maread99 commented Jul 19, 2023

dpguthrie commented Jul 19, 2023

ericwbzhang commented Jul 19, 2023 via email

maread99 commented Jul 19, 2023

dpguthrie commented Jul 22, 2023

ValueRaider commented Jul 22, 2023

maread99 commented Jul 24, 2023 •

edited

Loading

Ticker.history return messy tz info #208

Ticker.history return messy tz info #208

Comments

ericwbzhang commented Jul 19, 2023 • edited Loading

maread99 commented Jul 19, 2023 • edited Loading

maread99 commented Jul 19, 2023

ericwbzhang commented Jul 19, 2023

maread99 commented Jul 19, 2023

dpguthrie commented Jul 19, 2023

ericwbzhang commented Jul 19, 2023 via email

maread99 commented Jul 19, 2023

dpguthrie commented Jul 22, 2023

ValueRaider commented Jul 22, 2023

maread99 commented Jul 24, 2023 • edited Loading

Versioning

ericwbzhang commented Jul 19, 2023 •

edited

Loading

maread99 commented Jul 19, 2023 •

edited

Loading

maread99 commented Jul 24, 2023 •

edited

Loading