Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OHLC returns wrong/duplicate columns #305

Open
nocodehummel opened this issue May 18, 2020 · 11 comments
Open

OHLC returns wrong/duplicate columns #305

nocodehummel opened this issue May 18, 2020 · 11 comments

Comments

@nocodehummel
Copy link

Description

The issue occurs when multiple columns in the data set match the OHLC-names. The OHLC function returns each matching column. In the test case the data set contains a column from Stochastic oscillator (TTR function) with the name "slowD". As a result OHLC returns; "AAPL.Open", "AAPL.High", "AAPL.Low", "slowD", "AAPL.Close" (from a data set with 20 columns).

Expected behavior

Expect the OHLC function to return the Open, High, Low and Close data, nothing else.

Minimal, reproducible example

aapl <- quantmod::getSymbols("AAPL", from = "2019-01-01", auto.assign = FALSE)
stoch <- TTR::stoch(aapl[, c("AAPL.Close", "AAPL.Low", "AAPL.Close")])
x <- merge(aapl, stoch)
colnames(quantmod::OHLC(x))
@nocodehummel
Copy link
Author

In the source the "Low" column is found by grep('Low',colnames(x),ignore.case=TRUE).

Possible solution(s) can be to first attempt more strict matches, alternative to always return the first result grep('Low',colnames(x),ignore.case=TRUE)[1].

@joshuaulrich
Copy link
Owner

This is similar to #24. Have you thought through the discussion there?

@nocodehummel
Copy link
Author

Hi, sorry I only checked against the open issues.
Your suggestion to implement with test coverage makes sense.
Would it be okay to use testthat package (current test are not using it)?
If you are open to implement the improvement, I can prepare some tests.

@joshuaulrich
Copy link
Owner

No worries! I only wanted you to be aware of the history, so you could take it into account while thinking about the issue.

It would be great if you could write some tests, especially if you can think of any other edge cases that aren't mentioned in the other issue.

You can write them in the testthat style and/or use testthat locally. But I would use tinytest instead, so don't spend time getting the package to build/check with testthat.

@nocodehummel
Copy link
Author

In preparing test cases I do not get a positive result from has.Op using an object returned by getSymbols with column name changed to "Op". Given first three lines I would expect a TRUE (see below). In general attr(x, "some_attribute") returns NULL while attributes show the specific attribute. Btw is has.Op inconsistent with Op in this perspective? What is the expected behaviour?

`has.Op` <-
function(x,which=FALSE)
{
  colAttr <- attr(x, "Op")
  if(!is.null(colAttr))
    return(if(which) colAttr else TRUE)

  loc <- grep('Open',colnames(x),ignore.case=TRUE)
  if(!identical(loc,integer(0))) {
    return(if(which) loc else TRUE)
  } else FALSE
}

@nocodehummel
Copy link
Author

nocodehummel commented May 19, 2020

Based upon the test cases analysed from #24 and #305, the following solution should work.

`has.Op` <-
function(x,which=FALSE)
{
  cols <- colnames(x)
  n <- which(cols %in% c("Op", "Open"))
  
  if(identical(n,integer(0))) {
    n <- grep('Open$',cols, ignore.case = TRUE)
  }
  
  return(if(which) n else !identical(n,integer(0)))
}

And below solution for the related function (avoiding code duplication).

`Op` <-
function(x)
{
  col <- has.Op(x, which = TRUE)
  if(!identical(col, integer(0))) {
    return(x[, col])
  } else {
    stop('subscript out of bounds: open column not found')
  }
}

If you think this makes sense I can prepare a pull request tomorrow with the changes for OHLC functions and including tinytest cases.

@nocodehummel
Copy link
Author

Created pull request #306 to improve the column matching (with test cases).

@joshuaulrich
Copy link
Owner

Created pull request #306 to improve the column matching (with test cases).

I'm sorry I didn't respond earlier. I started working on the comments below on Saturday morning (which was still later than I should have). I'm going to leave these comments here, and I'll add the relevant comments to the PR.

Please keep in mind that I had most of the points below in mind before you created your PR. They're based on the code in your comments in this issue, so some points might not apply to your PR.


If you think this makes sense I can prepare a pull request tomorrow with the changes for OHLC functions and including tinytest cases.

I would appreciate a PR! Please take a look at the contributing guide before you get started.

Some notes to keep in mind:

  1. Please do not remove the column attribute handling in the has.* functions. That functionality has been there since 2011. Removing it could break existing code.
  2. n <- which(cols %in% c("Op", "Open")) will usually be integer(0). I don't feel good about that code being on the happy path when it's a special case. Not for performance reasons, but because it makes the function intent less clear.
  3. It's extremely helpful to have tests before starting to refactor code. Hopefully, the code already has tests, but that wasn't true in this case. So I would have written tests on the existing code first, to make sure I didn't break things when I was refactoring.

@nocodehummel
Copy link
Author

Thanks for your feedback. I will add the response to the PR #306.

@gcpoole
Copy link

gcpoole commented Jun 9, 2020

I have run into this bug too....

Have you considered using a REGEX for your pattern matching in grep? You could return "T" for any instance of "low" UNLESS it is followed by a period. For instance:

grepl(
  "low($|[^.]).*", 
  c("LOW.High", "LOW.Low", "Low", "low", "MSFT.Low", "slow", "Lower."),
  ignore.case = T
)

[1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

In the pattern low($|[^.]).*, low means match the letters l, o, w, in order.
($|[^.]) means then match the end of the string (the '$") or any character except a period (^ within [ ] means match anything other the characters following ^ within the [ ]). The . outside of [ ] means match any character at all and the * means zero of more times. So basically this tells the parser to "match 'low' followed by either the end of the string or as many other characters as you like, but just not a period.

I can think of one instance when this solution would fail: a ticker containing "LOW" followed by another letter. For instance, if there were a ticker "FLOWR", the bug would re-emerge.

The only other way I can see to solve this would be to match only columns that end in "low," which seems like a reasonable permanent solution. To do that, use "low$" as your grep pattern:

grepl(
  "low($|[^.]).*", 
  c("LOW.High", "LOW.Low", "Low", "low", "MSFT.Low", "slow", "Lower."),
  ignore.case = T
)

[1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

Perhaps implement the first solution now and warn users that in a future version, the column names will need to end with the key search words.

@nocodehummel
Copy link
Author

Thanks for your feedback. The proposed solution in the pull request is to catch exact matches first and secondly perform a regex with 'grep('Low$', colnames(x), ignore.case = TRUE)'. Similar to your second option. The first option fails for example on the symbol of this ETF.

The proposed solution works with how the columns are added in other quantmod functions, and by being more strict reduces the risk of interference with other packages/code. The downside is a risk to break code that people may have implemented based on the current matching.

@joshuaulrich joshuaulrich changed the title OHCL returns wrong/duplicate columns OHLC returns wrong/duplicate columns Apr 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants