1. Attachments are working again! Check out this thread for more details and to report any other bugs.

After losing verdict, Toyota settles in sudden acceleration case

Discussion in 'Prius, Hybrid, EV and Alt-Fuel News' started by bwilson4web, Oct 27, 2013.

  1. walter Lee

    walter Lee Hypermiling Padawan

    Joined:
    Oct 26, 2009
    1,126
    376
    5
    Location:
    Maryland
    Vehicle:
    2010 Prius
    Model:
    III
    Having multiple error bits flagged on the ECU device control block suggest a cascading system error - that is a primary error might trip several secondary errors which in turn trip several additional errors and so forth. Cascading system errors without time-sequence stamps will make it difficult to track what was the original primary error unless one has the source code (which tells the test engineer the interrupt- priority (error trapping sequence/heirarchy) - finding the primary error in a cascading error situation is liken to the proverbial *finding a needle in a haystack."

    Embedded computer systems like those often found in motor vehicles are often done in machine code because of memory restriction and the need for high operating speeds at low power levels - custom embedded system usually have custom ROM built in to hold the software ... so even with the correct disassembler to read the code - unless you know the *entire* memory map of where all the embedded ROM locations are - understanding code can be problematic. For example, machine code writing to a memory location may look totally useless until you realize it is writing to ROM location to test if it has been illegally copied.
     
  2. FL_Prius_Driver

    FL_Prius_Driver Senior Member

    Joined:
    Jun 17, 2007
    4,319
    1,527
    0
    Location:
    Tampa Bay
    Vehicle:
    2010 Prius
    Model:
    I
    My main day job is overseeing the design of manned and unmanned flight control electronics. I've watched how a poorly designed flight control system had a SEU (cosmic ray) completely destroy a UAV in under a second. I'm talking huge fireball, not a slight glitch. The entire engineering approach to these flight safety systems requires a design culture very, very different than what Toyota and every other car maker's approach to vehicle electronics. kbeck gave a pretty good description of what is done at the low level engineering level, so I'll talk about the high level company culture/approach of what ultra safe/reliable flight controls requires:

    1) Avionics starts with a national or internationally recognized standard for developing system design, hardware design, and software design. Basically, the organization has to set, meet, and prove they meet the safety and reliability standards industry avionics experts have proven to be valid and achievable. These are demanding. In the avionics world, this translates into three or more independent units calculating the flight control algorithm and a selection architecture ensuring all agree. When they do not agree, the odd unit out is ignored and must be fixed before flying again. When a flight control unit like this fails, it better be due to a meteorite hitting it or a lot of planes will be grounded quick.

    2) The three independent units must be proven to not have "common faults". An example would be a software bug that would simultaneously occur in all three units if they were running the same software. So of the three or more units, each should have a completely different real time operating system, the software for each would be written by independent teams, and three different compilers would be used to compile the source code into executable. Then three different processors would be desirable. In reality, this may not be possible (there are not that many aviation processor chip makers) so there is some streamlining, but the thought process intent can be accomplished if the design teams are disciplined.

    3) The logging and data storage of all the processing data is an integral part of the design. When a SEU occurs, exactly what happened and how it was handled must be recorded completely. At least in the fireball described above, where the SEU inverted the fight control algorithm was exactly located due to real time telemetry, so why the UAV turned into a pile of ashes was determined.

    I see none of these lessons applied in any car ECU. (However, Elon Musk has figured out just how critical ECU recording is, so at least one car maker has a clue.)
     
    kbeck and austingreen like this.
  3. kbeck

    kbeck Active Member

    Joined:
    Feb 10, 2010
    420
    275
    0
    Location:
    Metuchen, NJ
    Vehicle:
    2010 Prius
    Model:
    III
    So, let me make up off the top of my head why zillions of Toyotas aren't accelerating like crazy all over the landscape:
    1. If it was that obvious and common, it wouldn't have made it out of R&D over at Toyota, or out of the testing platform. (Unless it was a one-off that wasn't replicated. Those sometimes get "smoothed" over.. It's hard to troubleshoot something that's not there.)
    2. Think about my Cosmic Ray hypothesis which, as I said before, I wasn't kidding about. We're talking about Total Electronic Disruption a couple of micrometers across and a couple of miles long at random angles, at random times, and not that high a density. (If it was a high enough density to be really obvious, we'd all be dead of radiation poisoning. As it is, cosmic rays explains why, for example, that it takes ~5 pregnancies before a viable fetus is formed, the rest all spontaneously aborting - defects in the genome caused in large part by cosmic rays kind of kill the cell.) So, once in a great while, a Cosmic Ray comes down and hits the right bit at the right time. And the strength of the Ray and the strength of the ram cell (or whatever) have to be compatible. For all we know, there may be Toyotas out there that, when driven through a veritable Cosmic Ray storm show no ill effects, and others that stop working the moment the sun burps. In any case, given enough Toyotas out there, the laws of random numbers are such that it would be a near certainty that some car, somewhere, is going to get zapped eventually.
    So: Random events. Highly unlikely. But, given the software, these things will occur.

    KBeck
     
  4. jdcollins5

    jdcollins5 Senior Member

    Joined:
    Aug 30, 2009
    5,131
    1,340
    0
    Location:
    Wilmington, NC
    Vehicle:
    2010 Prius
    Model:
    III
    ^Is that really the best answer you can give?
     
  5. Mike500

    Mike500 Senior Member

    Joined:
    Mar 1, 2012
    2,593
    764
    0
    Vehicle:
    2012 Prius v wagon
    Model:
    Two
    It's like the lottery, and the odds are even greater.

    It can even be nearly impossible to reproduce.
     
  6. FL_Prius_Driver

    FL_Prius_Driver Senior Member

    Joined:
    Jun 17, 2007
    4,319
    1,527
    0
    Location:
    Tampa Bay
    Vehicle:
    2010 Prius
    Model:
    I
    Keep in mind, that most drivers are extremely competent and handle it well. A lot of transient problems are totally resolved by a restart or reboot. Nearly all computers depend on this for crash recovery. I would rather have a software event like this anyday over a tire blowing out while on the interstate. Likewise most crashes are due to the driver doing something very wrong, with blaming the car as a routine excuse. So what the actual situation is can be mind numbingly hard to conclusively determine.

    To truly track down and determine how many events are due to internal malfunctions requires very good data logging internals. To the extent that Toyota, or any car maker, forgoes intensive logging of error conditions and events, they do hold responsibility. They must provide the hard data showing what the ECU commands were during a critical event. Once they start doing that, big changes would be forthcoming....the first of which would be an international standard for ECU reliability.
     
    jdcollins5 likes this.
  7. a_gray_prius

    a_gray_prius Rare Non-Old-Blowhard Priuschat Member

    Joined:
    Jun 13, 2008
    2,927
    782
    0
    Location:
    IL
    Vehicle:
    2008 Prius
    Model:
    N/A
    This is demonstrably false. There are repair mechanisms in all cells which repair many, many defects in the genome and checkpoints that prevent cell division until these defects are fixed (although some beyond repair will undergo apoptosis). However, these processes don't always work. It's clear that they cleaned out all the "shallow bugs" but the deep, rare ones are always going to be hard to find in any software.
     
  8. Mike500

    Mike500 Senior Member

    Joined:
    Mar 1, 2012
    2,593
    764
    0
    Vehicle:
    2012 Prius v wagon
    Model:
    Two
    Then again, the bug might NEVER surface throughout the entire life of the program.
     
  9. Whirldy

    Whirldy Junior Member

    Joined:
    Aug 5, 2013
    89
    15
    0
    Vehicle:
    2013 Prius c
    Model:
    One
    +1 «As the thinker thinks, the prover proves.»
     
  10. fuzzy1

    fuzzy1 Senior Member

    Joined:
    Feb 26, 2009
    17,557
    10,324
    90
    Location:
    Western Washington
    Vehicle:
    Other Hybrid
    Model:
    N/A
    I'm not aware of any minimum (non-zero) occurrence rate. Various bug occurrence rates should fill the entire spectrum between 'too fast to count' and 'so rare it probably won't happen even once'. The frequent ones will get far more attention and debugging effort than the extremely rare ones.

    A certain type of electrical hardware error can theoretically be reduced to any arbitrary finite level above zero, but not to zero itself. In an imaging system where it just causes sparkles on a video display, it needs only to be kept low enough to not be much distraction. But in other uses where it can upset a system, it can theoretically be pushed down to once per second, once month, once per warranty period, once per device lifetime, or even to once for the entire product build over the age (so far) of the Universe. At a cost, of course. But it cannot be completely eliminated, even in the absence of cosmic rays.

    So just because an alleged bug doesn't materialize at a significant rate doesn't mean it must be just a figment of some lawyer's imagination. But I also suspect that when the extremely rare bug does happen, it most likely will get lost under the flood of 'pilot errors', not landing on the desk of the ambulance chaser who knows about it with a convincing client available.
     
  11. bwilson4web

    bwilson4web BMW i3 and Model 3

    Joined:
    Nov 25, 2005
    27,663
    15,662
    0
    Location:
    Huntsville AL
    Vehicle:
    2018 Tesla Model 3
    Model:
    Prime Plus
    Perhaps "never" might be a little strong:
    Our significant Gen III problem was the brake system:
    • Software update recall - resolved or diminished the "brake pause" problem.
    • Accumulator replacement - the first 80,000 apparently had a defective, metal bellows replaced.
    The rule of thumb I follow with intermittent problems is how they are like the sith:
    Source: Yoda, Episode I: The Phantom Menace

    A major intermittent problem often provides cover for a second, the less frequent one.

    We're also seeing reports of Gen III in taxi service having higher failure rates: traction battery, inverter, power steering, and transaxle. Sad to say, we have not found a Gen III taxi driver willing to collaborate on getting some metrics.

    As for software bugs, my other rule of thumb is moving a body of software from one computer system or language to another will often detect previously hidden errors. It can be as simple as a language change or changing processors and OS. So this weekend, a latent defect in program was found by integration and test of the Perl code on Solaris 10 and Redhat. Code that appeared to work on one, failed on the other, but it was legitimate bug. Just one system did not exhibit the failure symptom.

    Bob Wilson
     
  12. kbeck

    kbeck Active Member

    Joined:
    Feb 10, 2010
    420
    275
    0
    Location:
    Metuchen, NJ
    Vehicle:
    2010 Prius
    Model:
    III
    Here's a better: I spent a good chunk of last night going through the couple hundred page in-court transcript of Mr. Barr's testimony in front of the jury. Barr Testimony. That includes his testimony elicited by the friendly plaintiff lawyer and the cross-examination by the not-so-friendly defendant lawyer. He blew the latter away.

    My hair is on fire. Forget cosmic ray events for the moment:
    1. Stack was 91% full, semi-worst case, with recursive functions present but not called. No stack overflow check code. And the OS code/data area that controls which subroutines run is located just above the stack. Holy *****. Analysis showed that if the stack overflowed, the main "X" (they called it that) subroutine would stop, period. Toyota made a goof-ball error on Day 1 and did not account for the OS stack usage. (What!?!)
    2. They had fun stopping this routine with a car on a dynamometer. Car runaway resulted and, initially, scared the heck out of the tester.
    3. The watchdog timer function was called abysmal, not challenged by Toyota. From my reading of the unchallenged testimony, the primary reason it didn't work was because Toyota had overloaded this processor with too much to do. As a result, functions that might have safety reset the ECU were not present - had they been present, the ECU wouldn't have been shippable.
    4. Error codes from the OS were ignored. (Safety critical function - they're ignoring error codes!!!! @#$!%)
    5. DTC codes for problems with the throttle control were set by the same routine that controls the throttle - so, when the throttle process stopped, no DTC codes were saved. Worse - analysis and testing proved that, even when DTC codes should have been present in some cases (in the airbag system), they weren't. Bugs.
    6. The safety CPU that was supposed to check on the main CPU was faulty. On the one fault that was exercised by the Barr group, said CPU inadvertently (i.e., not by design) detected the fault. And, in this one case, could stall the engine - but only if the driver came all the way off the brake, then reapplied the brake, then waited a few seconds. Urghgh.
    7. No MR system at Toyota. Period. I cannot overstate on just how flat-out evil this is when running a software development operation. This could easily have been challenged by Toyota's lawyer, but was not.
    There's other fun stuff in there. One of the worst: NASA asked Toyota if the CPU had ECC (error-correcting code) memory. Toyota responded with a "yes". Toyota lied. As a result, NASA did not analyze a whole slew of possible failure modes. On the report, eventually made public, Toyota redacted every mention of ECC.

    At the time, ECC processors were more expensive that non-ECC processors. But ECC processors are much, much better at detecting cosmic ray events. Urgh. Double Urgh. My hair is on fire.

    I take back my comment from before that Barr had been doing cut-and-paste in his 800 page report. From the sounds of it, he didn't need to - there were that many errors.

    By the way: Barr was editor-in-chief of Embedded Processor magazine, has written three books on the subject, and spent 1.5 years pulling teeth out of Toyota's code, with three to six helpers.

    It may take me a day or three to calm down. Read the testimony.

    KBeck

    Fine. I agree that there are repair mechanisms. I'll agree that they fix a large percentage of errors introduced by $RANDOM problem, be it free radicals, cosmic rays, or whatever.

    But I'll also posit that they don't fix everything. If these processes don't fix everything then there are defects that remain. The implication is that if the uncorrected damage happens in the ovaries/testes then that damage will be present in the genome of an embryo.

    As it happens the stuff I use around here is heavy on error correcting codes. ECC codes can reduce the error rate by orders of magnitude - but they can't make it zero, very similar in overall function to Life.

    There are examples of highly rad-hard bacteria with multiple copies of their own genome and genome correction processes that make what humans (and most of the rest of DNA/RNA based life) do look silly. Said bacteria has few competitors in its environment, for obvious reasons. Put said bacteria in a nicer environment and they get out-competed by other bacteria that don't have to put out the same effort to just stay alive. Likewise, our correction processes are Good Enough - but not as good as said bacteria. Hence, mutations (rarely) and cell death (a lot more likely). I've read in multiple places that it takes, on average, five attempts to get a viable embryo, and it's DNA defects that kill off the misses.

    KBeck
     
    bwilson4web likes this.
  13. hill

    hill High Fiber Member

    Joined:
    Jun 23, 2005
    20,174
    8,353
    54
    Location:
    Montana & Nashville, TN
    Vehicle:
    2018 Chevy Volt
    Model:
    Premium
  14. bwilson4web

    bwilson4web BMW i3 and Model 3

    Joined:
    Nov 25, 2005
    27,663
    15,662
    0
    Location:
    Huntsville AL
    Vehicle:
    2018 Tesla Model 3
    Model:
    Prime Plus
    Your summary is enough to get my hair 'smoldering' and I haven't read the PDF, yet. Being an old guy, I'll print it up later tonight. I don't like to tie up a printer during the business day.

    You've done good!!

    Anything on the full report?

    Thanks,
    Bob Wilson
     
  15. austingreen

    austingreen Senior Member

    Joined:
    Nov 3, 2009
    13,602
    4,136
    0
    Location:
    Austin, TX, USA
    Vehicle:
    2018 Tesla Model 3
    Model:
    N/A
    We did find out during the congressional testimony that toyota did not actually do the type of testing some of us do on mission critical software.

    Lines of code is a bad metric. A good metric would be no logged incidents. Toyota until very recently refused to properly log information and to read it. We therefore should have severe doubts to any investigations with black boxes that do not record properely or are not read. IIRC all 2012 and later toyota's have proper logging, and we can see if we get incidents. I am absolutely sure the software has changed, which means this does not validate the software in older cars.
     
  16. kbeck

    kbeck Active Member

    Joined:
    Feb 10, 2010
    420
    275
    0
    Location:
    Metuchen, NJ
    Vehicle:
    2010 Prius
    Model:
    III
    Read the testimony.
    1. Barr's group has proof that the logging doesn't actually work in all cases.
    2. The main, "X" process, which controls throttle position, was nicknamed by his group the "kitchen sink" process, in that a ton of stuff was in there - including sending DTC codes to the black box. (Not present on the 2005 Camry that was the subject of the lawsuit, but was present on the 2008 Camry code that he also inspected.)
    3. A single bit flip of a particular unprotected RAM location could, can, and apparently has, in the field, stopped the "X" process in its tracks. At which point, on these Camrys:
      1. The throttle is stuck in whatever position it was in before the bit was flipped. If one was accelerating, one keeps on accelerating.
      2. The badly designed watchdog does not figure out that the blame throttle control process has died.
      3. The badly designed supervisory CPU does not figure out that the blame throttle control process has died, or, for that matter, even pays attention to the fact that the throttle is open and the driver is trying to brake the car.
      4. The bit in question is just above the stack space. So, a stack overflow, which is very possible, given that Toyota badly underestimated how much stack space they needed, and, additionally, used recursive functions that fill up stack space dramatically, is highly likely.
      5. No ECC RAM that would have detected such a cosmic ray bit flip which would lead to an ECU reset. Present in later versions of the Camry. However, given that Toyota was not mirroring OS variables (of which that bit was one), ECC RAM would not have helped in the case of a functional, stack overflow condition.
    It's not the lines of code metric. Barr's group ran other tools to determine the complexity and testability of the code. Of the various piles of code they examined, a large number dinged up as "untestable" and a smaller, but critical and very scary pile dinged up as "unmaintainable".

    And what has my hair on fire is that there was no MR system for tracking errors.

    There is no, I repeat, no excuse for software development practices as listed above on man-safety critical hardware and software. This was simply pure, organization-driven massive stupidity. Toyota deserves every bit of the bitch-slapping they are about to get from other injured, maimed, and the estates of dead plaintiffs. And the people who allowed this to occur should be fired - but I don't expect that that will happen in insular Japan.

    If you guys want to see some real furor, skip the EE Times article and go to EDN magazine. There's a lot more coverage and some really pissed-off embedded systems programmers hanging out there.

    KBeck
     
  17. austingreen

    austingreen Senior Member

    Joined:
    Nov 3, 2009
    13,602
    4,136
    0
    Location:
    Austin, TX, USA
    Vehicle:
    2018 Tesla Model 3
    Model:
    N/A
    Please re-read what I wrote, I said that in 2012 they finally started logging well, we know that likely logged badly on purpose in the mid 2000s. That came out in the congressional testimony.

    I know your hair is on fire, I'm not trying to put it out;-) but explain to the non-technical people out there that lines of code are a bad metric. Poor logging is a symptom of likely bugs in the code. You pointed out analysis of sloppy programming, also an area that could likely cause problems. Do we know if these practices led to any of the fatalities? No! But they certainly would leave doubt in my mind that toyota was truthful, and lead to findings against the company.
     
  18. FL_Prius_Driver

    FL_Prius_Driver Senior Member

    Joined:
    Jun 17, 2007
    4,319
    1,527
    0
    Location:
    Tampa Bay
    Vehicle:
    2010 Prius
    Model:
    I
    First, thank for explicit details. Discussing facts is so much more productive than discussing opinions.

    The engineering culture of Japan is almost entirely inverted from the engineering culture of the US of A. Specifically, for most of my career, "manufacturing" engineering is considered what you do if you cannot do "real" engineering. As a telling example, I interviewed a whole series of HW engineers and ranked them. The HR department in turn passed the resumes of the top third to the RF engineering department, the middle third to the Digital Engineering department, and the bottom third to the Manufacturing department. It's total garbage, but very hard to change.

    Meanwhile in Japan, Manufacturing Engineering is considered the top of the pyramid and software engineering is at the bottom. Don't expect any Software powerhouses to come out of Japan. Toyota is probably the extreme of this culture since they view themselves as a manufacturing company. This is strongly based on the legacy of Taiichi Ohno. Following in his footsteps is the Japanese engineering ideal there. As for a Japanese software legend, the pickings are very slim.
     
    Chazz8 likes this.
  19. fuzzy1

    fuzzy1 Senior Member

    Joined:
    Feb 26, 2009
    17,557
    10,324
    90
    Location:
    Western Washington
    Vehicle:
    Other Hybrid
    Model:
    N/A
    I haven't yet been able to read the testimony, but the discussion in this thread still leaves a glaring omission.

    Alleged SUA victims were claiming that the service brake failed. The NASA report stated:
    Has this changed? If the electronic throttle control fails in a WOT state (or other bad condition) the way the old mechanical controls could, is there a path that could disable the service brake? Or can the transmission be stuck in Drive, unable to shift to Neutral, with enough engine power to override the brake?

    Sure, the throttle control firmware needs much better design. But my hair isn't going to be set on fire if the brakes still work as intended, especially when the faulty electronic controls cause fewer engine runaways than did the old mechanical controls.
     
  20. bwilson4web

    bwilson4web BMW i3 and Model 3

    Joined:
    Nov 25, 2005
    27,663
    15,662
    0
    Location:
    Huntsville AL
    Vehicle:
    2018 Tesla Model 3
    Model:
    Prime Plus


    I am only half-way through the testimony and wanted to share some progress and early impressions:
    • Barr's testimony alone would make an excellent introduction to the art of programming. I'm looking forward to buying his book.
    • He introduces, explains, various bugs and reports they were 'found in the code' but not demonstrated.
      • Perhaps static code analysis has advanced far enough that this now works. My past experience with code testing software (40 years ago?) was not positive.
      • His testimony would be more compelling IF he said 'we replicated this problem in 1-2 cases' with the debugger.
    • The first demonstration with the Camry showed the effect of "task x" being manually killed. The symptoms and mitigation details about how 'riding the brake' was impressive.
      • I am bothered that they manually killed the task. I'll be looking for a description of how task failure could be replicated in operation.
      • Implicit in the testimony is both the 'watchdog' and monitor processor failed to detect the task x failure and mitigate the loss. I look forward to how these were tested.
    Like I mentioned, static code tools and techniques 40 years ago were under impressive. So I prefer seeing faults replicated or detected using a debugger or other tool. Still, I quite agree with your recommendation and found it good, not as perfect as I might want but good enough:
    I will probably get his book.

    Bob Wilson

    ps. The code testing software I remember seemed to be a lot of 'smoke' but not so much heat. I like using compiler based flags but 3d party code analyzers, well I probably need to do some research. Now if they could just fix the famous web site . . .